Accessibility

ColdFusion Article

 

The RSS Watch Sample App (Part 1): Monitoring RSS Feeds Automatically

Raymond Camden

www.coldfusionjedi.com

Blogs have exploded on the net and are one of the most popular ways to keep up to date with a variety of topics and personalities. However, with this explosion of blogs to read, it's hard to keep up with the amount of generated content. This has given rise to blog readers that you use to get the latest headlines from blogs, and blog aggregators, which bring together multiple blogs into one easy-to-read website.

Even with these tools to help you keep up to date, it still isn’t completely easy to monitor all the blogs you would like to watch. What if ColdFusion could read the blogs for you and let you know when it discovers certain keywords?

It is that need that gave me the idea for creating the RSS Watch sample application. RSS Watch monitors any RSS feed and returns matches when it finds a keyword. The main functionality is in a ColdFusion component (CFC). The application also uses XML, the CFHTTP tag, and the CFSCHEDULE tag, which runs the utility hourly and sends e-mail when it finds results.

This is part one of a two-part series. After reading part one, check out The RSS Watch Sample App (Part 2): Improving and Enhancing the Application.

Requirements

To complete this tutorial you will need to install the following software and files:

ColdFusion MX 6.1

Dreamweaver MX 2004

Tutorials and sample files:

Designing the CFC

The CFC, named rssWatch, contains the majority of functionality. It has three major tasks or methods:

  1. The getSearches method: This method loads a configuration file that specifies the keywords to look for and where to look for them.
  2. The rssParse method: For each item that the CFC searches for, the rssParse method loads one or more RSS feeds and translates the XML into a set of data.
  3. The processSearches method: This method scans the data to see if a keyword matches, and if so, it adds the data to the result set.

If you have downloaded the source code for this application, open rssWatch.cfc. The CFC contains the three methods I described above. I start with the method that specifies the keywords that the CFC searches for, the getSearches method.

The getSearches Method

The first method, getSearches, returns a set of data specifying the words to search for and which RSS feed to examine. The method returns an array. Each item in the array is a structure with two keys: terms and rss. Terms is a string that represents the search terms; RSS is an array of URLs that are valid RSS feeds.

What’s interesting is that when I began designing this CFC, I wasn’t sure how I wanted to store the configuration data. I knew it would be in an XML file, but I wasn’t 100% sure of how I would set it up. For that reason, the original version of the CFC method contains static data, as shown in Listing 1.

Listing 1 : Original version of the getSearches method, which contains static data

<cffunction name="getSearches" returnType="array" 
output="false" access="private"
		hint="Handles getting search data and returning it to the processor">
	
	<cfset var aSearches = arrayNew(1)>
	
	<cfset aSearches[1] = structNew()>
	<cfset aSearches[1].terms = "camden">
	<cfset aSearches[1].rss = arrayNew(1)>
	<cfset aSearches[1].rss[1] = "http://www.fullasagoog.com/xml/ColdFusionMX.xml">

	<cfset aSearches[2] = structNew()>
	<cfset aSearches[2].terms = "blog">
	<cfset aSearches[2].rss = arrayNew(1)>
	<cfset aSearches[2].rss[1] = "http://www.camdenfamily.com/morpheus/blog/rss.cfm?mode=short&">
	
	<cfreturn aSearches>
	
</cffunction>

If you remember, I said the return data is an array of structures. In the code above, I’ve simply hard coded two items. The first item searches the RSS feed, FullAsAGoog, for any mention of my last name. (What? You mean you didn’t realize I was simply building this as an ego booster?!?) The second item searches for the word "blog" in the RSS feed for my own blog.

Obviously this static data isn’t the final version of the method, but because I have to create a structure in which to return data, I can leave this code as is and focus on the more difficult aspects of the application.

The processSearches Method

As you can imagine, the processSearches method is the main method for performing the searches. Because it is a complex method, take a look at it line by line with me. Start with part 1 of the method:

Listing 2 : processSearches method (part 1)

<cffunction name="processSearches" returnType="array" output="false"
			hint="Processes all the searches.">
	
	<cfset var x = "">
	<cfset var y = "">
	<cfset var z = "">
	<cfset var mySearches = getSearches()>
	<cfset var rssItems = "">
	<cfset var result = arrayNew(1)>

The cffunction tag declares the method and defines the returnType attribute as array. The CFC returns any matches in an array of structures. I define the exact structure a bit later. Next is a set of var scope variables. You set a scope of Var for any variable created inside the method that exists only for the execution of that method. If you do not use the var scope, the variable will exist in the Variables scope of the CFC. Note the line that defines the mySearches variable. I create this variable using the getSearches() method, defined earlier.

Listing 3 : processSearches (part 2)

<cfloop index="x" from="1" to="#arrayLen(mySearches)#">
		
	<cfloop index="y" from="1" to="#arrayLen(mySearches[x].rss)#">

The main portion of the method has two loops. The first loop iterates over the top level items in mySearches. If you remember, the getSearches method contains an array of structures. This loop simply iterates over each of those structures. The structure in each array element contains two keys: terms, which contains the terms to search for, and rss, an array of RSS feeds. Therefore, the second loop iterates over the RSS array.

Listing 4 : processSearches (part 3)

<!--- See if we have this URL in cache already --->
<cfif not structKeyExists(variables.httpCache, mySearches[x].rss[y])>
	<cfhttp url="#mySearches[x].rss[y]#">
	<cfset variables.httpCache[mySearches[x].rss[y]] = cfhttp.fileContent>
</cfif>

For each RSS feed, you must use the CFHTTP tag to download the result. As you can imagine, this is the slowest part of the method. It is possible, however, that your searches will reuse the same RSS feed. For example, you may search one RSS feed for "Camden" and "ColdFusion." Therefore, in the CFC, I created a cache to store the results of the CFHTTP tags. You define this cache, called variables.httpCache, in the constructor area of the CFC on line 3:

<cfset variables.httpCache = structNew()>

The code in Listing 4 simply specifies that if the CFC has not downloaded the feed, to get it and add it to the cache.

Listing 5: processSearches (part 4)

<cfset rssItems = rssParse(variables.httpCache[mySearches[x].rss[y]])>

<!--- check result to see if our term is matched --->
<cfloop index="z" from="1" to="#arrayLen(rssItems)#">
	<cfif findNoCase(mySearches[x].terms, rssItems[z].title) or
         findNoCase(mySearches[x].terms, rssItems[z].description)>
		<cfset result[arrayLen(result)+1] = structNew()>
		<cfset result[arrayLen(result)].terms = mySearches[x].terms>
		<cfset result[arrayLen(result)].rss = mySearches[x].rss[y]>
		<cfset result[arrayLen(result)].matchedItem = rssItems[z]>
	</cfif>
</cfloop>

The first line in Listing 5 calls the rssParse method. I will discuss this method later on. Basically the method converts an XML string into an array of RSS items that you can search. The method loops over the resulting array and for each instance, checking the item's title and description to see if it matches the current search terms. If the method finds a match, you add it to the result array. The method’s result array is an array of structs, where each element contains:

That’s it! The rest of the method simply closes the loops you opened and returns the result variable.

The rssParse Method

For the processSearches method to work correctly, it must convert the XML returned from the RSS feed into a simple array of items for the method to search. The problem is that there are different kinds of RSS feeds. The rssParse method must be able to handle any of them. For this article, however, I keep things simple. I tried various feeds, and when the function broke, I modified the method until it worked again. A more scientific approach would have been better of course, but what’s nice is that since I abstract this method from the rest of code, I can add support for additional types of RSS feeds without affecting the rest of the code. Take a look at the method in Listing 6.

Listing 6 : rssParse

<cffunction name="rssParse" returnType="array" output="true"
	       hint="Attempts to parse the RSS feed for the items.">
		
	<cfargument name="packet" type="string" required="true">
	<cfset var xmlData = "">
	<cfset var result = arrayNew(1)>
	<cfset var x = "">
	<cfset var items = "">
	<cfset var xPath = "">
	<cfset var node = "">

	<cftry>
		<cfset xmlData = xmlParse(arguments.packet)>
		<cfif xmlData.xmlRoot.xmlName is "rss">
			<cfset xPath = "//item">
		<cfelse>
			<cfset xPath = "//:item">
		</cfif>
			
		<cfset items = xmlSearch(xmlData,xPath)>
		
		<cfloop index="x" from="1" to="#arrayLen(items)#">
			<cfset node = structNew()>
			<cfset node.title = items[x].title.xmlText>
			<cfset node.description = items[x].description.xmlText>
			<cfset node.link = items[x].link.xmlText>
			<cfset result[arrayLen(result)+1] = duplicate(node)>
		</cfloop>
		<cfcatch>
			<cfif isDebugMode()><cfdump var="#cfcatch#"></cfif>
		</cfcatch>
	</cftry>
				
	<cfreturn result>
	
</cffunction>

The method begins with an argument declaration followed by a set of var scope variables. I don’t spend any time on most of these as they are self-explanatory. Note the result variable, however. This is the array that the method returns. Each instance of the array is a structure containing the title of the RSS item, the description, and the link.

The major part of the function begins with the CFTRY tag. I wrap everything in a CFTRY tag so that the method ignores invalid RSS feeds. It would probably be better to log these bad feeds so that you can double check that you used the right URLs. But for right now, the method simply ignores them.

The method converts the string passed to the method into a valid XML object using the XMLParse() function. I encountered two types of RSS feeds during my testing. The first type was wrapped in <rss> tags. The second type was wrapped in <rdf:RDF> tags. For each of these, I use the xmlSearch function to retrieve the items. For the feeds using <rss>, I use a value of //item. For the <rdf:RDF> variety I use //:item. You pass this value to the XMLSearch function as the XPath value. XPath is a complex syntax for searching XML documents; I encourage you to do some research on it. For our purposes, it converts the XML document into a very simple array of items. I loop over each item and create a node structure. I populate it with the title, description, and link property for each item. Once complete, I copy the node to the end of my result array. Finally, the method returns the result.

Note the following code: <cfif isDebugMode()><cfdump var="#cfcatch#"></cfif>. When the current request is in debug mode, I don’t suppress the error; I output it so that I can examine it. I never run code in debug mode on a live server so ColdFusion will ignore this line when I run it on the production machine.

Testing the CFC

The CFC still has a "fake" method to gather the terms and RSS feeds to search, but you can still test it. Listing 7 simply creates an instance and runs the processSearches method.

Listing 7 : test.cfm

<cfset x = createObject("component","rssWatch")>
<cfset results = x.processSearches()>
<cfdump var="#results#">

If you run this code, your results will vary depending on the content feed. Figure 1 shows an example of the output.

The RSS search results displayed with the cfdump tag

Figure 1. The RSS search results displayed with the cfdump tag

(+)View larger

Now that you know it works, replace the getSearches method with the redesigned method in Listing 8.

Listing 8 : getSearches method

<cffunction name="getSearches" returnType="array" 
output="false" access="private"
		hint="Handles getting search data and returning it to the processor">
	<cfset var aSearches = arrayNew(1)>
	<cfset var data = "">
	<cfset packet = "">
	<cfset searches = "">
	<cfset result = arrayNew(1)>
	<cfset x = "">
	<cfset y = "">
		
	<cffile action="read" file="#expandPath("./searches.xml")#" variable="data">
	<cfset packet = xmlParse(data)>
	<cfset searches = xmlSearch(packet,"//search")>

	<cfloop index="x" from="1" to="#arrayLen(searches)#">
		
		<cfset result[arrayLen(result)+1] = structNew()>
		<cfset result[arrayLen(result)].terms =
 searches[x].terms.xmlText>
		
		<cfset result[arrayLen(result)].rss = arrayNew(1)>
		
		<cfloop index="y" from="1" to="#arrayLen(searches[x].rss)#">
			<cfset result[arrayLen(result)].rss[arrayLen(result[arrayLen(result)].rss)+1] = searches[x].rss[y].xmlText>
		</cfloop>
				
	</cfloop>

		
	<cfreturn result>
</cffunction>

The code begins by reading an XML file. The code assumes that the file exists in the same directory as the CFC and is named searches.xml. This file follows a simple format: A root <searches> node surrounds a set of search nodes. Here is an example node:

<search>
	<terms>xbox</terms>
	<rss>http://www.engadget.com/rss.xml</rss>
	<rss>http://www.gizmodo.com/index.xml</rss>
</search>

Each node contains terms to search for and a set of RSS feeds to check. Once the code reads the file and parses it into XML, the code runs an XPath search on it using //search. This returns an array of all the search nodes in the XML packet. You loop over these nodes and add the data to the result array. It should be obvious now, but as you can see, designing this method last let me design how the XML packet would look last.

Putting It All Together

So, now that you have created the RSSWatch tool, you can use it. Create a script that runs the tool and sends the results in an e-mail. Then, schedule the process to run hourly.

Listing 9 : runner.cfm

<cfset rssWatch = createObject("component","rsswatch")>
<cfset matches = rssWatch.processSearches()>

<cfif arrayLen(matches)>
	<cfmail to="ray@camdenfamily.com"
			from="rsswatch@127.0.0.1"
			subject="RSSWatch Matches Found!" type="html">
<style>
h2 {
	font-face: Arial; 
	 
	font-weight: bold
}

p {
	font-face: Arial;
	
}
</style>

<h2>RSSWatch Search Results</h2>

<p>

Here are the results from your RSSWatch process.<br>
This search was done at #timeFormat(now(),"h:mm tt")# on #dateFormat(now(),"m/d/yy")#<br>
There were <b>#arrayLen(matches)#</b> match(es)
</p>

<cfloop index="x" from="1" to="#arrayLen(matches)#">
<p>
#x#) Matched <b>#matches[x].terms#</b><br>
Feed: #matches[x].rss#<br>
Title: <a href="#matches[x].matchedItem.link#">#matches[x].matchedItem.title#</a><br>
Description: #matches[x].matchedItem.description#<br>
</p>
</cfloop>

	</cfmail>
</cfif>

The first two lines act just like the test template. It creates an instance of the CFC and then runs the processSearches method. If you have any results, the code uses the cfmail tag to create an HTML-based e-mail and then loops over the results. The application displays the term that matched and the RSS feed URL, item title, and description. Once you have scheduled the process in ColdFusion Administrator, sit back and let the application mail you when it finds a blog entry you're interested in.

You can check out my live, multi-user version of the RSS Watcher application on my website. It's based on the code in this article. In this live version, you can register to receive e-mail notifications when the application finds matches in the RSS feeds you specify.

This is part one of a two-part series. After reading part one, check out The RSS Watch Sample App (Part 2): Improving and Enhancing the Application.

About the author

Raymond Camden is the owner of Camden Media, Inc, a web development and training company. A long time ColdFusion user, Raymond has worked on numerous ColdFusion books including the ColdFusion Web Application Construction Kit and has contributed to the Fusion Authority Quarterly Update and the ColdFusion Developers Journal. He also presents at numerous conferences and contributes to online webzines. He founded many community web sites including CFLib.org, ColdFusionPortal.org, ColdFusionCookbook.org and is the author of open source applications, including the popular BlogCFC blogging application.

Raymond can be reached at his blog or via email at ray@camdenfamily.com. He is the happily married proud father of three kids and is somewhat of a Star Wars nut.