Home | Index

SearchMonkey Guide

Creating a (Web Service) Custom Data Service

The following tutorial explains how to create an example "Web Service" style custom data service. In this example, we create a custom data service that, given a search result's URL, queries the Yahoo! Site Explorer API and fetches two links that are closely related. Before trying this tutorial, you should be familiar with basic SearchMonkey concepts and SearchMonkey's screens for creating custom data services.

[Note] Note

In order to keep the entire example simple and self-contained, this tutorial uses input data from the Yahoo! Index. Most real custom data services of this type will rely on input parameters from outside the Yahoo! Index.

  1. From the main SearchMonkey Applications screen, click Create a new data service. SearchMonkey displays “Step 1: Basic Info”.

  2. Enter a Name: "Test Site Explorer Data Service"

  3. For Type, click Web Service.

  4. Enter a Description: "A test data service for Yahoo! Search's Site Explorer. Passes in the search result's URL and retrieves related URLs from Site Explorer's 'pageData' web service."

    Even if you don't plan to share your data service, the description is still useful for private development. This is particularly true if you end up creating several applications that have similar functionality. The description should not only indicate which web service the data service calls, but what kinds of data it provides.

  5. Read the Terms of Service if you have not done so already. Select the Terms of Service checkbox.

  6. Click Next Step. SearchMonkey saves your changes and displays “Step 2: Inputs”.

  7. Leave the Parent Adjunct as "yahoo:index" and the Parent Item (rel) as "<root>". Specify a single Meta (property) of "dc:identifier".

    These settings specify the root-level dc:identifier in the Yahoo! Index. This dc:identifier is the URL of the search result, such as http://www.hp.com. Since Site Explorer provides a web service that returns information about URLs, this dc:identifier should serve as a fine input parameter for our example data service.

  8. Click Next Step. SearchMonkey saves your changes and displays “Step 3: Test Data”.

  9. Specify three test parameters. These values are hardcoded URLs for testing the Yahoo! Site Explorer web service.

    • http://www.ibm.com

    • http://www.apple.com

    • http://www.hp.com

  10. Click Next Step. SearchMonkey saves your changes and displays “Step 4: Endpoint”.

    [Note] Note

    If there are any problems with the extraction code, the Preview Pane displays a bulleted list of warnings and errors.

  11. Enter a Webservice Endpoint of

    http://search.yahooapis.com/SiteExplorerService/V1/pageData?
    appid=oDjsEKHV34Gs9ihGt5Yqg8OFeB9f9czKd4xAGvRzUaN54Nw109mOzoa5SNATM.ocxoUN3X_MgQ--
    &query={dc:identifier}&results=4

    Make sure to paste in all lines above, with no spaces or newlines. Site Explorer's pageData web service requires three input parameters:

    • appid, representing a Yahoo! developer appID. An appID is a long crypted string that publicly identifies you as a registered Yahoo! developer. Ordinarily, you would have to acquire a Yahoo! appID before making any Site Explorer web service call. However, for the purposes of this tutorial, we have provided a generic "demo" appID for you. If you already have an appID, you can use it instead.

    • query, representing the URL that you want more information about. Setting this to {dc:identifier} parameterizes the web service call. For each search result or set of test data, SearchMonkey substitutes in the appropriate dc:identifier value when calling the web service.

    • results, representing the number of related URLs to return.

    [Note] Note

    All input parameters are urlencoded.

  12. Select an Output Format of Other. Remove all the contents of the textarea and replace them with the following XSLT stylesheet:

    <?xml version="1.0"?>
    <xsl:stylesheet xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
                    xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
                    xmlns:h="http://www.w3.org/1999/xhtml"
                    xmlns:y="urn:yahoo:srch" 
                    xsi:schemaLocation="urn:yahoo:srch http://api.search.yahoo.com/SiteExplorerService/V1/PageDataResponse.xsd">
    <xsl:template match="/">  
    <adjunctcontainer xmlns:my="http://example.com/ns/1.0">   
      <adjunct id="smid:{$smid}" version="1.0">   
        <meta property="my:link1"> <xsl:value-of select="//y:Result[1]/y:Url"/></meta>  
        <meta property="my:result1"> <xsl:value-of select="//y:Result[1]/y:Title"/></meta>  
        <meta property="my:link2"><xsl:value-of select="//y:Result[2]/y:Url"/></meta>
        <meta property="my:result2"><xsl:value-of select="//y:Result[2]/y:Title"/></meta>
      </adjunct>
    </adjunctcontainer>
    </xsl:template>
    </xsl:stylesheet> 

    Boilerplate code — "Start matching templates at the root node."

    Boilerplate code — Specifies the root element for extracted data, <adjunctcontainer>.

    The <adjunctcontainer> declares a custom namespace, my, which we will use to define the <meta> properties below. This namespace allows us to use values outside the default SearchMonkey vocabularies.

    Specifies an <adjunct> element to encase your extracted data. An adjunct may contain zero or more <item> and <meta> elements. You should always set the id attribute to the value "smid:{smid}", which causes SearchMonkey to supply a globally unique ID for you.

    Describes some data on the page. A <meta> contains a literal value (actual data returned from the web service). This particular example sets the property attribute to my:link1, indicating that the <meta> represents the first link in the series. As defined in the <adjunctcontainer>, my:link belongs to a custom namespace. This example is designed to demonstrate that you can use custom namespaces, but in general you should try to use the default SearchMonkey vocabularies, since these vocabularies have more well-known semantics.

    Note that rather than having a parent <item>, these <meta> elements appear directly underneath the <adjunct>. This is a perfectly valid way to provide metadata about the entire page, rather than the page itself. In this case, we are retrieving "URLs that are related to the page", so it is entirely appropriate to place the <meta> elements as direct children of the <adjunct>.

    Another <meta>, this one specifying the link's title. After this, we retrieve a second url/title pair, and end the transform.

  13. Click Save and Refresh. SearchMonkey refreshes the Preview Pane, displaying the effects of your data service on the first test URL.

    The data appears to be acceptable for both links. Both my:links are URLs, both my:results appear to be page titles, Finally, both results are URLs that are related to the input URL, http://www.ibm.com. In fact, the "best match" URL is the input URL itself, which ought to be encouraging.

    Click Input and Output to view the module's input and output XML. These links are handy for debugging your data service.

    Step through the other test results and verify that the Preview Pane is displaying the expected output.

  14. Click Next Step. SearchMonkey saves your changes and displays “Step 5: Confirmation”.

  15. Congratulations, you are done with the tutorial! You may now click Create a new Presentation Application and start building a presentation application based on this data service. If you have not already done so, take a look at “Presentation Application Screens” or try out the presentation application tutorial. Otherwise, return to the Application Dashboard.