Early on, Macromedia understood the need to provide ColdFusion developers with the ability to integrate advanced search features into their applications. Since 1997, it has integrated Verity into ColdFusion to provide that advanced search. Today, hundreds of thousands of developers have taken advantage of this Verity search to enhance the value and functionality of ColdFusion-based sites and applications. The search technology embedded in ColdFusion MX 7 is Verity’s flagship search product, Verity K2, the leading enterprise search software on the market today. With this integration of Verity K2 into ColdFusion, developers have access to the most sophisticated and powerful search capabilities available at a fraction of the cost of acquiring Verity K2 search separately.
This article explains how you can use a specific piece of the Verity search functionality – the Verity Spider (otherwise called vspider.exe or vspider). Vspider is a tool that you can use to index content and build collections that are searchable by the user. An important new feature became available to ColdFusion MX 6.1 and 7.0 with vspider—now, ColdFusion users can extend vspider to build collections from data stored on a server other than the one hosting your Coldfusion server. This feature enhances the way you use search functionality within ColdFusion applications; as you can implement an enterprise-wide search solution with ColdFusion. Learn more in the Verity white paper, "Understanding Verity’s ColdFusion Search Expansion Pack."
To complete this tutorial you will need to install the following software and files:
The search functionality within ColdFusion performs searches against collections, not against the actual documents and database records within ColdFusion. A Verity collection is a special index that you create with Verity Spider or the ColdFusion tag, cfindex. These functions locate all the searchable documents and/or database content and extract the text and metadata within each document or record and other information, such as document zone and field data, word proximity, and the physical file system address or URL. Verity gathers all of this information together in the Verity collection. By combining this information into one index and running searches against it, rather than having to locate and access the actual documents and databases each time a user searches for information, you dramatically increase the speed and relevancy of your ColdFusion search capabilities. Verity also makes available advanced features, such as document summaries in results lists and the ability to limit searches to specific groups of documents.
Within ColdFusion, you can build searches of multiple collections, each of which can focus on a specific group of documents or queries, according to subject, document type, location, or any other logical grouping. Because you can perform searches against multiple collections, developers have substantial flexibility in designing a search interface.
You can generate collections with either vspider or cfindex. But when should you use vspider? But when should you use cfindex? The following table helps you decide.
| Function | cfindex | vspider |
|---|---|---|
| Indexes ColdFusion documents | Yes | Yes |
| Indexes file system documents | Yes | Yes |
| Indexes documents outside of ColdFusion Server | * No | Yes |
| Indexes a wide range of doc types | Yes | Yes |
| Indexes dynamic content | No | Yes |
| Configure and use through CF Administrator | Yes | No |
| Configure and use through command-line interface | No | Yes |
| Schedule indexing jobs | Yes | Yes |
With vspider, you can index web-based and file system documents in over two hundred of the most popular application document formats, including Microsoft Office, WordPerfect, ASCII text, HTML, SGML, XML and PDF (Adobe Acrobat) documents.
Vspider uses HTTP to "crawl" web servers and collect content to index. Vspider starts crawling at a particular web address you specify with the -start parameter value, for example, http://www.macromedia.com. Vspider requests this page and processes it, collecting all the words from the page and adding them to the index. It also collects all the referring links to other pages and adds these pages to a queue to process in a manner similar to the first page.
There are two main advantages to using vspider instead of cfindex:
When indexing using the standard ColdFusion search tags, ColdFusion MX communicates with a private-branded Verity K2 server, called the ColdFusion MX Search Server that creates the collection and indexes documents for the tag.
Unlike the cfcollection tag, vspider acts directly on the collection without the use of ColdFusion MX Search Server. Vspider also has the ability to create a collection on its own. However, since vspider acts directly on the collection, the ColdFusion MX Search Server has no knowledge that the collection exists. The collection won’t be available for search through ColdFusion MX unless you specify the collection information to the ColdFusion MX Search Server explicitly. To do so, use ColdFusion Administrator to register the collection with the K2 Server by specifying the "create" option, as follows:
cfcollection (action="create")
If the collection exists, as in this case, ColdFusion Administrator will simply register the collection with the K2 Server.
The following is a simple example of a command line for creating a collection called myCollection:
Vspider –collection cf_root\verity\Data\Colls\myCollection –style cf_root\verity\Data\stylesets\ColdFusionVspider
Many customers use ColdFusion as the basis for their enterprise search initiatives. However, one of the restrictions of the out-of-the-box version of vspider is the limitation to index and search content stored on the same machine that hosts your ColdFusion server. Due to the sophistication of the search capabilities in Verity K2, the adoption of ColdFusion within many enterprises, and the desire to reduce and simplify development efforts, many developers want to expand the scope of the built-in vspider search capabilities to include content stored on machines other than the one that hosts your ColdFusion server. This capability is now available through the Verity ColdFusion Search Expansion Pack. Learn more information by downloading the white paper, "Understanding Verity’s ColdFusion Search Expansion Pack."
The following section contains examples of indexing using vspider in different scenarios. You can use these examples as a starting point for developing your own scripts. Notice that some of the examples described involve indexing content stored on remote web servers. Since the default vspider license is restricted to localhost indexing, the examples that describe non-localhost indexing require the Verity ColdFusion Search Expansion Pack.
You can find the vspider command line utility in cf_root\verity\k2\platform\bin directory of your ColdFusion MX installation. The easiest way to reference vspider is to add the \bin directory to your PATH environment. On Unix platforms, add the \bin directory to the LD_LIBRARY_PATH environment variable.
Vspider has a few basic parameters. The following are two simple command line examples of using vspider:
Vspider –collection cf_root\verity\Data\Colls\myCollection –style cf_root\verity\Data\stylesets\ColdFusionVspider –cgiok –start http://www.macromedia.com
The definitions of the parameters are as follows:
| Parameter | Specification |
|---|---|
-collection |
Specifies the file system path to the collection |
-start |
Specifies the starting point for crawling and indexing documents. If your website has multiple starting points, use multiple –start arguments in your command line. |
-cgiok |
Specifies that vspider will index dynamic content. Although the name suggests that vpsider will only index content generated by CGI, it really indicates any dynamic content. |
-style |
Specifies the file system path to the style file that defines the schema of the collection. Notice that vspider has specific style files that it uses, compared to ColdFusion MX. It’s important that these style files are used in conjunction with vspider. |
The following section contains examples of indexing using vspider in different scenarios. You can use these examples as a starting point for developing your own scripts.
The definition of an index that spiders a single web server is as follows:
The syntax to use is as follows:
vspider -cmdfile /verity/vspider/intra.cmd
The file, intra.cmd contains the following specifications:
-collection icd.coll -start http://sigma.macromedia.com -style cf_root\verity\Data\stylesets\ColdFusionVspider -host sigma.macromedia.com -cgiok
The definition of an index that spiders a single web server but excludes certain pages is as follows:
-start)-host)-cgiok) -exclude)The syntax to use is as follows:
vspider -cmdfile /verity/vspider/intra.cmd
The file, intra.cmd contains the following specifications:
-collection icd.coll -start http://sigma.macromedia.com -style cf_root\verity\Data\stylesets\ColdFusionVspider -host sigma.macromedia.com -exclude */underconstruction/* -cgiok
The definition of an index that spiders an entire intranet is as follows:
-start)-start)-domain)-cgiok) The syntax to use is as follows:
vspider -cmdfile /verity/vspider/intra.cmd
The file, intra.cmd contains the following specifications:
-collection icd.coll -start http://sigma.macromedia.com -start http://colt.macromedia.com -style cf_root\verity\Data\stylesets\ColdFusionVspider -domain macromedia.com -cgiok
The definition of an index that crawls an entire website, parsing for links to other documents but does not index any HTML document that contains the text "welcome" in the <Title> tag. The syntax is as follows:
vspider -cmdfile /verity/spider/skip1.cmd
The file, skip1.cmd contains the following specifications:
-collection icd.coll -start http://www.mysite.com -style cf_root\verity\Data\stylesets\ColdFusionVspider -indskip title "welcome"
Use the following syntax to add only Microsoft Word and Excel documents to an existing collection:
vspider -collection icd.coll -start http://www.mysite.com -indmimeinclude application/msword -indmimeinclude application/excel
The -indmimeinclude option specifies to vspider to index only the specified MIME types. This example contains an additional instance of –indmimeinclude, which is necessary to index a second MIME Type. Likewise, you could include all values in a single instance of -indmimeinclude.
To update a large collection, but only with documents that indexed at least 30 hours ago, you do not need to specify -style because you are updating an existing collection that already contains style files. You can use the following syntax:
vspider -cmdfile /verity/spider/update.cmd
The file, update.cmd contains the following specifications:
-collection icd.coll -refresh -refreshtime 1 day 6 hours
Some of the available command-line options for vspider are as follows:
-include/indinclude-exclude/indexclude-mimeinclude/indmimeinclude-mimeexclude/indmimeexcludeThese command-line options can seem a bit confusing at first. But as you use this tool, you will see that they all have their use. The difficulty is determining which option to use.
Take a look at the options –include and –indinclude. The -include option will only process pages that meet the expression criteria. Processing a page is defined as indexing the page, and following the links within the page.
In the following command-line example:
-include ‘*memo*’ -start ‘http://web.macromedia.com/docs’
The starting page does not meet the expression criteria (-include ‘*memo*’), therefore, vspider will not index this page or follow the links within the page. In other words, vspider will exit successfully, having indexed nothing.
On the other hand, if you specified the -indinclude option, vspider only indexes pages that meet the expression criteria. It will read pages that do not meet its expression criteria, making it possible to follow the links within those pages. If you changed the previous command-line example to use -indinclude ‘*memo*’ vspider would begin at the –start page specification, read it, and follow links from that page, perhaps to http://web.macromedia.com/docs/memos, to find and index content.
The same logic applies to the other options (-exclude/indexclude and so on). Understanding the nuances of these options will save you many headaches as you try to index your content.
This article explained how you can use Verity search functionality in your ColdFusion applications. The Verity Search Expansion Pack allows you to build collections from data stored on a server other than the one hosting your Coldfusion server. This feature enhances the way you use search functionality within ColdFusion applications; as you can implement an enterprise-wide search solution with ColdFusion. Learn more in the Verity white paper, "Understanding Verity’s ColdFusion Search Expansion Pack."
Joe Cronin is director of Technical Services in Verity, Inc.'s Channel Partners group. He has a Bachelor of Science degree in computer engineering technology from Wentworth Institute of Technology. Verity is recognized by industry analysts such as Gartner, IDC, and the Delphi Group as the market leader in enterprise search software, including text search, classification, recommendation, monitoring, and concept extraction solutions. For more information, write to cfsearch@verity.com.