Accessibility
Icon or Spacer
   

Understanding Verity Collections in ColdFusion

By Jeremy Petersen
Senior Developer
TeachStream.com

ColdFusion 4.5 comes bundled with a custom implementation of the Verity97 search engine. It gives ColdFusion users the ability to perform searches on database and file content. Even though Verity's power is well known, a surprisingly small numbers of Web developers take advantage of it. To get the most out of Verity, a general understanding of its internal workings and file structure can be beneficial.

This article outlines the basic components of Verity, how it integrates with ColdFusion, and a few best practices for using it. While this article will relate the rudimentary components, please refer to the Verity and ColdFusion documentation for complete instructions.

Verity File Structure

Before it can search, Verity must first index data into a collection. A collection is a set of files that represents a group of files - plus a set of metadata about those files - that is optimized for searching. The specific information stored in a collection includes various word indexes, an internal documents table containing document field information, and pointers to the actual document files for file and path indexes.

When you create a collection, you must first choose its location. The default location for all collections is localdrive:\CFUSION\Verity\Collections\. After choosing a home for your collection, assign a name for the collection that will be referenced by ColdFusion as well as the name of the collection root folder. If you stick with the default collection path, you should end up with localdrive:\CFUSION\Verity\Collections\myCollectionName\.

Each collection root folder contains two subfolders folders: custom and file. The type of index used will dictate which of these folders is populated with index data. In Verity indexes, file is used for TYPE="File", and for TYPE="Path", and custom for TYPE="Custom".

Inside these folders, you will find more folders, such as ASSISTS, MORGE, PARTS, PDD, STYLE, TEMP, TOPICDX, TRANS, and WORK folders. Actual data files in these folders are stored in an incrementing eight digit numeric format (e.g., 00000001.ddd, 00000005.ddd, etc.) and will increase with index modification transactions (e.g., update, optimize, purge, etc.). Certain transactions can also reset these numbers (e.g., purge).

ASSISTS, PARTS, and STYLES are most relevant to basic ColdFusion-powered Verity functionality.

ASSISTS

The ASSISTS folder stores transaction log files. Although these files are only 2K each in size, they can become very numerous and gobble up hard drive space. Optimization, purging, and refreshing can help keep these files to a minimum.

PARTS

The PARTS folder contains the actual data files that Verity performs searches against. Every time an index is updated, a set of DDD and DID files are added. As the volume of files grows, search times slow down because Verity must search through the additional files. A vital part of good Verity performance, optimizing the index defragments this directory by combining all of the individual files into one set of optimized files.

Optimized, these data files take up a fraction of the original content file size (depending on file type complexity). Simple files like TXT and HTML take up far less space than more complex files types, such as PDF or GIF files.

STYLE

The style directory holds many files that determine collection setup. By default, you do not need to change any of these files, but you can access them to tweak certain verity trades.

The most useful example of this is the STYLE.PRM file. The STYLE.PRM file allows you to control the summary output of Verity search results. Style manipulation is very complicated so be careful when changing the files and make sure your indexes are not in use while changes are being made. Also, you will need to rebuild the index for the changes to take effect. Search the ColdFusion Forums for help.

Verity Functions

ColdFusion gives you two ways to work with your Verity collections: the ColdFusion Administrator or the CFINDEX tag. You can perform the following functions on Verity indexes:

Update

Every time you run an index update, DID and a DDD files are added to the applicable PARTS folder. These new files represent the data you just indexed.

Optimize

All DDD and DID files in the PARTS folder will be deleted and a single, optimized set of DDD and DID files will be created. In addition, the ASSISTS folder will be cleaned and all ABT files will be deleted except the latest versions.

Please not that using the CFINDEX tag or the ColdFusion Administrator to optimize a collection results in the cleanup of only the FILE/ASSISTS or CUSTOM/ASSISTS folder. If you are in a file index, it will only clean the FILE/ASSISTS. If you are in a custom index, it will only clean the CUSTOM/ASSISTS folder.

Running a purge or refresh via CFINDEX (not ColdFusion Administrator) adds to both the FILE/ASSISTS and CUSTOM/ASSISTS folders. This can cause unnecessary ABT files to multiply unchecked and unnecessarily increase the total index file size. Over the long run, this "Verity bloating" effect can really add up!

Purge

The purge function adds ABT files to the CUSTOM/ASSISTS and FILE/ASSISTS folders. It also deletes all PDD, DDD, and DID files.

Using the CFINDEX tag to perform a purge results in two ABT files being added to both the FILE/ASSISTS and CUSTOM/ASSISTS folders. If you are working with a FILE or CUSTOM index, running a purge from the ColdFusion Administrator results in only one ABT file being added.

Delete

The delete function removes all collection folders and files.

Refresh

The refresh function cleans out the PARTS folder and adds DID and DDD files to the applicable PARTS folder as well as adding ABT files to the CUSTOM/ASSISTS and FILE/ASSISTS folders.

Much like the purge function, using the CFINDEX tag to perform a refresh adds two ABT files to both the FILE/ASSISTS and CUSTOM/ASSISTS folders.

Verity Collection Best Practices

Verity Bloating

As shown above, using CFINDEX to purge and refresh your indexes can cause ABT files to be placed in both the FILE/ASSISTS and the CUSTOM/ASSISTS folders. Because optimization only optimizes one of these folders, the other folder may continue to accrue files. Be sure to keep an eye on this effect and delete the files by hand if required.

Minimize Transactions

With each index update, Verity collections become more fragmented, including increased amounts of physical disk space used and the slower speed at which the collection is accessed.

The idea is to minimize update transactions. For example, Allaire Spectra uses multiple Verity indexes for each Allaire Spectra object. The important thing to note is the performance difference in looping over and adding each item one at a time as compared to adding all the items in one shot.

Using a custom tag that creates random (but realistically sized) dummy objects, 1,000 objects were created. Using cfa_contentobjectgetmultiple to grab all 1,000 and then looping over the results pool to update them one at a time, the resulting files totaled 157MB and took five hours and 40 minutes. Optimizing the collections lasted 20 minutes and reduced the collection to 14.3MB.

After building one result set out of a separated list of ObjectIDs and performing one set of index updates on the 1,000 objects, the collection came to a scant 17.9MB and only took 22 minutes to build. After optimization, which lasted 22 seconds, the collections totaled 14.3MB.

Lock Collections During Update, Optimize, Refresh, and Purge

The collections should be locked during methods. Also, do not have a collection or any of its files open (e.g., with Windows Explorer) while it is being updated. This can cause corruption.

Deleting a Collection

The physical directories and files associated with Verity collections also have registry entries. If you chose to remove a collection by hand without using ColdFusion Administrator or the applicable CFM tag or your collection becomes corrupt and can not be removed by normal means, you will need to remove its registry entry, directories, and files. The registry entry resides in software/Allaire/current version/ collections.

This method for removing collections is not recommended. Do so at your own risk.

Optimize

Optimization of ever-changing indexes consistently improves search performance more than any other Verity method. Unfortunately, optimization takes up a lot of CPU overhead and can even make your search unavailable while being performed. This is obviously not ideal for a live application.

It quickly becomes a balancing act of weighing the performance gains vs. slowing (or even stopping) search availability to perform the optimization. Because no two sites are alike, you will have to weigh the amount of live updates (i.e., Verity fragmentation) and collection sizes against your search performance times and all other unique factors to formulate a best practice for your Web application.

Ideally you can perform the optimization and perhaps even the index updates themselves during off-peak hours. To provide "live" searchable data is not always an option and you will need to find a creative way to sneak index updates. This topic has no easy answer, especially in a clustered server environment. Search the ColdFusion Forums for help.

Collections Limits

There is no set limit to the size of Verity collections. The hardware and RAM of the index server also affect its performance.

CFSEARCH Limits

CFSEARCH cannot return a result set greater than 64K. This simply means if you have a large enough collection and your search terms are broad enough, you will hit this limit and an error will be thrown. A best practice would be to catch the error and inform the user that he or she needs to narrow his or her search terms.

Using the MAXROWS and STARTROW parameters of the CFSEARCH tag does not remove data from the results set. Rather, they only filter what data is visible in the results set. Therefore, if you set MAXROWS to one and your results set still breaks the 64K limit, your code will still bomb and produce the 64K limit error message.