
The third step of creating a data service is to specify your page extraction rules. Given a particular page structure, you must specify an XSLT stylesheet to extract the desired data and represent it as DataRSS. This step is the heart of your data service.
![]() |
Note |
|---|---|
If there are any problems with the extraction code, the Preview Pane displays a bulleted list of warnings and errors. |
Page Extraction Rules — Specifies an XSLT stylesheet for extracting data from target pages.
![]() |
Important |
|---|---|
The XSLT transformation acts on a cleaned-up version of the page's markup, whether that markup is XHTML or HTML. SearchMonkey attempts to close open tags, quote attribute values, and otherwise represent the page as a well-formed XML source document that XSLT can operate upon. |
SearchMonkey sets the URL of the target page as a parameter,
{$CURRURL}. Any namespaces that appear in your XPath
expressions are automatically stripped out.
![]() |
Tip |
|---|---|
|
For assistance with writing XPath expressions, use the Firefox extension XPather. Once you install XPather, you can select text on a web page, right-click, and select . XPather then generates an XPath expression designed to extract the selected text, which you can then use in your data service. Although you can use the provided textarea to develop your XSLT code, you are also welcome to use your own editor or IDE. |
— Links to the short example stylesheet Example 2.2, “Extracting a Photo from a Page with XSLT”, annotated with comments. For an alternative XSLT example, refer to “Creating a (Page) Custom Data Service”.
Example 2.2. Extracting a Photo from a Page with XSLT
Consider a snippet of HTML resembling:
<html>
...
<div class="photo">
<h2 class="title">Vienna in Spring</h2>
<p class="date">2008-04-13</p>
<div class="thumb">
<a href="http://foo.com/images/vienna-large.jpg">
<img src="http://foo.com/images/vienna-thumb.jpg" alt="Vienna ..."/>
</a>
</div>
</div>
...
Although the page lacks microformat or RDF markup, there is still some useful information here that we would like to extract.
The example stylesheet extracts information from the target page and transforms it into a short snippet of valid DataRSS. The stylesheet is designed to act upon an HTML page with the basic structure above.
<?xml version="1.0"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:template match="/"><adjunctcontainer>
<meta property="assert:nonzero"> <xsl:value-of select="count(//div[@class='photo')"/> </meta> <adjunct id="smid:{$smid}" version="1.0">
<xsl:for-each select="//div[@class='photo']">
<item rel="media:image" resource="{.//a/@href}">
<meta property="dc:title">
<xsl:value-of select=".//h2[@class='title']"/>
</meta> <meta property="dc:date"> <xsl:value-of select=".//p[@class='date']"/> </meta> <item rel="media:thumbnail" resource="{.//div[@class='thumb']//img/@src}"/>
</item> </xsl:for-each> </adjunct> </adjunctcontainer> </xsl:template> </xsl:stylesheet>
The example stylesheet has the following components:
|
Boilerplate code — "Start matching templates at the root node." |
|||
|
Boilerplate code — Specifies the root element for
extracted data, |
|||
|
Specifies an |
|||
|
Declares a for-each loop over every |
|||
|
Provides a container for a set of interesting related
data; in this case, metadata about a photo. An
|
|||
|
Describes some data on the page. A
|
|||
|
A |
|||
|
Describes another container for related data. You can
nest <item> elements as deep as necessary to represent
the structure of your data properly. In this case, the child
|
— Saves your changes and continues to “Step 4: Confirmation”.
— Saves your changes and returns to “Step 2: URLs”.
— Returns to the Application Dashboard.
At the bottom of the screen is the Preview Pane, which displays the results of your data service for one of your test URLs.
If all is well, the Preview Pane should
display an HTML bulleted list of name/value pairs representing the
DataRSS structure for that URL. <item>
rel attributes and <meta>
property attributes appear as regular text, while literal
<meta> values and the values of resource
attributes appear in bold. If there are any problems with the extraction
code, the Preview Pane displays a bulleted list of
warnings and errors.
The Preview Pane contains several controls for cycling through your test URLs and determining your data inputs and outputs:
— Saves any changes to
your XSLT and calls your data service again, displaying the results
in the Preview Pane. When you click
for the first time,
SearchMonkey fetches content for all of your test URLs, which can
cause a delay and a "Loading, Please Wait"
message.
— Displays a link to the URL being tested. The link opens the test URL in a new window. Use this option to verify that your data service is retrieving the correct data from the correct page.
— Cycle the Preview Pane through your test URLs.
— Indicates which URL of
the total N is being tested.
You can jump to any test URL by clicking on the corresponding
dot.M
— Displays the XHTML content of the URL being tested in a new window. If the page is HTML or invalid XHTML, SearchMonkey displays the cleaned-up version that your XSLT transformation is actually acting against. Use this option to verify that the test URL has the structure that you think it does.
— Displays the DataRSS XML output of the URL being tested in a new window. Use this option to verify that your data service is producing the correct DataRSS.