RDF, XSLT, & the Monkey Make 3

If you've been hanging around the YDN recently, you've probably heard a thing or two about SearchMonkey.

And why not? SearchMonkey is pretty darn cool.

It lets you enhance the appearance of search results for your favorite sites. So the next time you need to look up, say, restaurants, a SearchMonkey app can distill all of the important information, like location, price range, and rating all into one place, right there in your search results.

A ton of people have been tinkering with SearchMonkey since it launched in May. One of the main reasons for this (aside from how cool it is...and never mind the $10,000 contest they held recently), is how easy it is to pick up and start developing with.

In this article, I'll go over XSLT and RDF--two of the fundamental concepts that power SearchMonkey. If you're looking to build your first app or you've built a few and want to get more out of it, you'll definitely want to read on.

RDF

RDF stands for the Resource Description Framework. It provides a standard way to organize information into semantic units. Structuring information with RDF allows authors the ability to preserve relational and meta information. This is a very good thing.

For instance, here's a plaintext snippet of a newspaper article in unformatted plaintext:

Link By Link
This Is Funny Only if You Know Unix
By Noam Cohen
Published: May 26, 2008
For a certain subset of Internet users, “Sudo make me a sandwich” may as well be “Take my wife ... please.”

Sure, it's easy for us to identify all the different parts just by context. You could pick out the author's name, the title, and the date it was published without a breaking a sweat. For computers, on the other hand, figuring all this out is at best tedious and in the worst case, pretty much impossible. With a trillion+ webpages out there, being able to programatically index meta-information is more important than ever.

So let's try this again, this time with RDF:


<rdf:Description>
<dc:identifier>http://www.nytimes.com/2008/05/26/business/media/26link.html?_r=1&oref=slogin</dc:identifier>
<dc:relation>Link By Link</dc:relation>
<dc:title>This Is Funny Only if You Know Unix</dc:title>
<creator>Noam Cohen</creator>
<dc:date>Mon May 26 00:00:00 -0700 2008</dc:date>
<dc:rights>Copyright 2008 The New York Times Company</dc:rights>
For a certain subset of Internet users, “Sudo make me a sandwich” may as well be “Take my wife ... please.”
</rdf:Description>

Now that's what I'm talking about! It may look weird with its tags showing, but it looks great when rendered properly by a browser. By adding a little context, programs like SearchMonkey can extract this information and organize it in meaningful ways.

One caveat that I'll get to later in this article: SearchMonkey uses a standard similar to RDF called DataRSS. The same general principles apply, so it's good to have an understanding of RDF first coming into it.

When you go to the SearchMonkey application builder, you have the choice to build either a Custom Data Service, which translates a web page or API call into DataRSS, or a Presentation Application, which builds on data services to build what you see when it's in use on a search result page. Presentation Applications are as simple as filling in the blanks in a template, so most of the work in developing with SearchMonkey is parsing data sources into the right format. That's where XSLT comes in.

XSLT

Now let's say we have this kind of information available to us. How do we actually get at it?

Well, lucky for you, SearchMonkey is built on a standard called XSLT, which was designed for just such a task.

XSLT, or eXtensible Stylesheet Language Transformations, is a W3C specification1 for how to manipulate XML.

1 XSLT and RDF are both specifications drafted by the World Wide Web Consortium, or W3C. Once drafted, a specification is handed down to vendors, who are _supposed_ to write software that follow the spec. XSLT is actually a pretty good example of this in practice, as opposed to CSS, which any web developer will tell you, isn't exactly standard across browsers.

XSLT is itself XML, which makes it familiar, if a bit verbose. Using XPath, a querying language inside the XSLT spec2, you pick out nodes from the XML tree using selectors, and process them according to different templates that you specify.

2 There's a long story behind this. Basically, XPath was motivated by XSLT and something called XPointer. XSLT itself was developed alongside another similarly-featured technology called XQuery. Wikipedia tells me there's something else called XLink, which is summed up eloquently by the title of the 2002 O'Reily article: "XLink: Who Cares?"
It boggles the mind that a topic with so many Xs could be so unbelievably boring.

XSLT in the Sandbox

Let's say you're compiling a list of your favorite cities. Because it seemed like a good idea at the time, you decided to format it in XML, along with additional information, like each city's country, population and area. You want to put this online, but don't want to do any more work as the list grows over time.

What's that about a database?
PHP?
Bah!

Might I suggest XSLT for this heavily rhetorical situation?

Anyway, here's the list of cities:

<?xml version="1.0" encoding="UTF-8"?>
<cities>
<city name="Kyoto">
<country>Japan</country>
<population>1464990</population>
<area units="km²">1779</area>
</city>
<city name="San Francisco">
<country>United States</country>
<population>764976</population>
<area units="km²">600</area>
</city>
<city name="Portland">
<country>United States</country>
<population>568380</population>
<area units="km²">376</area>
</city>
<city name="Bremen">
<country>Germany</country>
<population>548477</population>
<area units="km²">1679</area>
</city>
<city name="Doha">
<country>Qatar</country>
<population>339847</population>
<area units="km²">2574</area>
</city>
</cities>

Looks good. Now how do we turn this into HTML?

Even if you're not too familiar with XML, you'll notice that it bears a striking similarity to what you see when you view a webpage's source code. That's because HTML (or rather XHTML, to be precise) is XML. Since XSLT can transform an XML document into another, that means any XML document can be turned into XHTML and vice-versa.

First Iteration

Let's make a simple XSLT document to list the names of the cities in the list:


<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/">
<html>
<head></head>
<body>
<h1>Cities</h1>
<ul>
<xsl:for-each select="//city">
<li><xsl:value-of select="@name"/></li>
</xsl:for-each>
</ul>
</body>
</html>
</xsl:template>
</xsl:stylesheet>

This should give you a good idea of the basic way XSLT works. First, as with any XML document, we declare our XML header. Once we've done that, we open up the main part of the document, the <xsl:stylesheet> element. The xmlns:xsl="http://www.w3.org/1999/XSL/Transform" specifies the namespace of the document as an XSL Transformation.

Within the stylesheet, we can define a number of <xsl:templates>, which trigger when the specified pattern is matched. This <xsl:template> matches on "/", or the root element, meaning that it triggers at the start of a document. This catch-all will always trigger first, so it's a good place to start constructing our XHTML.

xsl:for-each allows us to iterate through all elements that match a particular query. As I mentioned before, these matcher strings are XPath selectors. The two slashes in //city looks for every instance of a <city> element, regardless of hierarchy. We could also have gotten the same result, in this case, if we had used /cities/city. Attributes of an element can be accessed with the prefix @. This is how we get the name of each city for our list.

You can get a more comprehensive guide to XPath at W3 Schools

Second Iteration

An unordered list of cities is alright, but what if we want to display all of our information?

Here's how we could output our information into a table:


<table>
<xsl:for-each select="//city">
<tr>
<td><strong><xsl:value-of select="./@name"/></strong></td>
<td><xsl:value-of select="./population"/></td>
<td><xsl:value-of select="./area"/> km<sup>2</sup></td>
<td><xsl:value-of select="./country"/></td>
</tr>
</xsl:for-each>
</table>

Just as before, we iterate through all of the cities using <xsl:for-each>, this time wrapping everything in a tr tag. Notice that the population, area, and country are not attributes, but rather the actual contents of their respective elements. ./ is just a more explicit convention that I prefer, meaning "this element" (just like a filepath in Unix).

Third Iteration

As a really quick example to throw out there, this final iteration shows off some advanced features that you might find useful.


<h1>My <xsl:value-of select="count(//city)"/> Favorite Cities</h1>
<table>
<thead>
<tr>
<th>Name</th>
<th>Population</th>
<th>Area (sq miles)</th>
<th>Country</th>
</tr>
</thead>
<tbody>
<xsl:for-each select="//city">
<xsl:sort select="./@name"/>
<tr>
<td><strong><xsl:value-of select="./@name"/></strong></td>
<td><xsl:value-of select="format-number(./population, '#,###')"/></td>
<td><xsl:value-of select="format-number(./area * 0.386102159, '#,###')"/></td>
<td><xsl:value-of select="./country"/></td>
</tr>
</xsl:for-each>
</tbody>
</table>

In our <h1&h1;, we count the number of elements from a particular query result by using the function count. There are a lot of useful built-in functions like this. Again, check out the W3 Schools reference for a good listing.

Another thing you can do is sort the results of a <xsl:for-each> by using the <xsl:sort> element. The specifier points the value you want to order on for each element. In our case, we're ordering our list alphabetically by the name of the city.

Finally, you might want to format or operate on a value using XSLT. In this example, we're doing both, by converting the area in km2 into square miles by multiplying by the conversion ratio, and then taking that, and formatting it to use comma separators. format-number(), like count(), is a built-in XSLT function.

XSLT in Practice

Now that we're using XSLT with a fair degree of confidence, we're read to go back to our NY Times article. Our example showed how the article might be marked up using RDF, but since we want to use it with SearchMonkey, we'll want to translate that into DataRSS.

Unfortunately, in the real world, markup is wildly inconsistent across different websites. You can never tell how easy it will be to get to the information you want without looking under the hood at the page source code. If you're lucky, your website of interest gives special classes or ids to things you want to get at. If you're not so lucky, there may actually be no consistent solution for how to access information across the site.

In our case, the nytimes.com doesn't just have good markup, they even have some custom tags, like <NYT_BYLINE> that we can take advantage of. Here's what I came up with:

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:template match="/">
<adjunctcontainer>
<adjunct id="smid:{$smid}" version="1.0">
<item rel="rel:Posting">
<meta property="dc:identifier"><xsl:value-of select="//meta[@name='articleid']/@content"/></meta>
<meta property="dc:type">News Story</meta>
<meta property="dc:title"><xsl:value-of select="//NYT_HEADLINE"/></meta>
<meta property="creator"><xsl:value-of select="//NYT_BYLINE"/></meta>
<meta property="dc:date"><xsl:value-of select="//meta[@name='pdate']/@content"/></meta>
<meta property="dc:summary"><xsl:value-of select="//NYT_TEXT/p"/></meta>

<item rel="rel:Photo" resource="{//div[@class='image']//img/@src}">
<meta property="media:width"><xsl:value-of select="//div[@class='image']//img/@width"/></meta>
<meta property="media:height"><xsl:value-of select="//div[@class='image']//img/@height"/></meta>
</item>
</item>
</adjunct>
</adjunctcontainer>
</xsl:template>
</xsl:stylesheet>

This should give you a good idea of what to expect with DataRSS. For the most part, you'll be worrying about items and their contents. Items have a rel property that specifies exactly what's represented, whether it's a person, article, photo, etc. Here are the docs for rel properties.

Items have meta elements, with property attributes that work kind of like regular RDF tags. For example, dc:title is a property now, whereas in RDF it was its own tag. Check out the docs to find all of the available properties in DataRSS.

Once you've built a data service in SearchMonkey, you can start writing a presentation layer to pull from it. Since you'll be referencing values pulled from DataRSS, there's a big incentive for diligence in how you name elements. Don't obsess over it, but use what's available with rel's and properties as best as you can.

Now Start Building!

XSLT and RDF are pretty simple on their own. Scraping sites, however, is pretty tedious. Although we didn't build a SearchMonkey Application in this article, we covered all of the basics of how you might go about it.

Once you've gotten the hang of it, it starts to be pretty fun, so get out there and start hacking with SearchMonkey!

Mattt Thompson
YDN Tech Evangelist

Editor's note: Props and fond farewells to Mattt Thompson, who's heading back to college. Mattt, you'll be missed here at the YDN. So long and thanks for all the fishsticks.