Home | Index

SearchMonkey Guide

Creating Custom Data Services

Custom data services are a way to extract interesting data from websites and web services, even when there is no semantic web data or DataRSS feed available. A custom data service enables you to create SearchMonkey application for your site rapidly. You can try out the SearchMonkey framework without having to make a huge commitment in terms of creating a feed or redesigning your site templates.

Assuming your prototype proves successful, you can subsequently expose your site's data in a more robust way, by marking up your pages with embedded RDF or microformats, or by exposing a DataRSS feed. Custom data services are intrinsically slower than these other data service types, since the other types are all automatically cached by Yahoo! servers. If a custom data service is particularly slow, presentation applications that use it might automatically be displayed in the Infobar template, in order to provide users with faster access to basic search results.

If you do not own the site you are building an application for, custom data services can be very useful, because they allow you to extract data that is otherwise hard to use programmatically. However, if the site already exposes the data you need in the form of semantic web data or a DataRSS feed, you should use this faster, pre-cached data instead.

[Note] Note

When a data service accesses a page, it provides a SearchMonkey User Agent:

User-Agent: Mozilla/5.0 (compatible; Yahoo! SearchMonkey 1.0;
http://developer.yahoo.com/searchmonkey/useragent)

Site owners can choose to block SearchMonkey using robots.txt or other means.

Converting OpenSearch to DataRSS

In “Step 4: Endpoint”, you create a stylesheet to transform XML from a remote web service into SearchMonkey's native DataRSS. However, if the web service returns OpenSearch XML reponse elements, SearchMonkey can automatically transform those results to DataRSS for you.

OpenSearch is a collection of simple formats for sharing search results, while DataRSS is a general catalog feed format. Even though these formats are different in nature, SearchMonkey does have a basic mapping for transforming one into the other. As an example, for the following OpenSearch result:

<?xml version="1.0" encoding="UTF-8"?>
 <feed xmlns="http://www.w3.org/2005/Atom" 
       xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">
   <title>Example.com Search: New York history</title> 
   <link href="http://example.com/New+York+history"/>
   <updated>2003-12-13T18:30:02Z</updated>
   <author> <name>Example.com, Inc.</name> </author> 
   <id>urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6</id>
   <opensearch:totalResults>4230000</opensearch:totalResults>
   <opensearch:startIndex>21</opensearch:startIndex>
   <opensearch:itemsPerPage>10</opensearch:itemsPerPage>
   <opensearch:Query role="request" searchTerms="New York History" startPage="1" />
   <entry>
     <title>New York History</title>
     <link href="http://www.columbia.edu/cu/lweb/eguids/amerihist/nyc.html"/>
     <id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
     <updated>2003-12-13T18:30:02Z</updated>
     <summary>
       ... Harlem.NYC - A virtual tour and information on 
       businesses ...  with historic photos of Columbia's own New York 
       neighborhood ... Internet Resources for the City's History. ...
     </summary>
   </entry>
 </feed>

SearchMonkey would automatically transform the feed into the following DataRSS:

<adjunctcontainer>
   <adjunct version="1.0" id="smid:assigned">
     <meta property="dc:identifier">
        http://www.columbia.edu/cu/lweb/eguids/amerihist/nyc.html
     </meta>
     <item rel="dc:subject">
       <type typeof="media:Article">
           <meta property="dc:type">searchresult</meta>
           <meta property="dc:title">New York History</meta>
           <meta property="dc:date">2003-12-13T18:30:02Z</meta>
           <meta property="dc:description">
              ... Harlem.NYC - A virtual tour and information on
              businesses ...  with historic photos of Columbia's own New York
              neighborhood ... Internet Resources for the City's History. ...
           </meta>
        </type>
     </item>
  </adjunct>
</adjunctcontainer>

To take advantage of this automated transformation, select the OpenSearch radio button in the Endpoint screen. No custom XSLT is required.

[Note] Note

Only the first <entry> element in the OpenSearch feed is added to the DataRSS <adjunct>. All other <entry> elements are ignored.

Web services can use the OpenSearch protocol to provide SearchMonkey with additional structured data by extending their feeds in the following manner, using the name "opensearch" in the <adjunct>:

<feed xmlns="http://www.w3.org/2005/Atom"
       xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/"
       xmlns:y="http://search.yahoo.com/datarss/">
   ...
   <entry>
     <title>New York History</title>
     <link href="http://www.columbia.edu/cu/lweb/eguids/amerihist/nyc.html"/>
     ...

     <y:adjunct name="opensearch" version="1.0">
        <y:item rel="action:discuss" resource="http:/columbia.edu/cu/lweb/wiki"/>
     </y:adjunct>

   </entry>

If the service embeds the adjunct in this manner, SearchMonkey can retrieve the contents directly. Alternatively, sites can publish a full-fledged DataRSS feed. For more information, refer to Chapter 3, Site Owner Guide.