Home | Index

SearchMonkey Guide

Creating a (Page) Custom Data Service

The following tutorial explains how to create an example "Page" style custom data service. In this example, we create a custom data service that extracts hResume microformat data from LinkedIn user profile pages. Before trying this tutorial, you should be familiar with basic SearchMonkey concepts and SearchMonkey's screens for creating custom data services.

As mentioned in “Data Service Types”, the Yahoo! Search Crawler already extracts microformat data... so why go to the effort of creating a custom data service for extracting hResume? Three main reasons:

Once the custom data service is complete, you can continue to “Creating a Presentation Application”, which uses your newly-created data service to enhance search results.

  1. From the main SearchMonkey Applications screen, click Create a new Data Service. SearchMonkey displays “Step 1: Basic Info”.

  2. Enter a Name: "Test LinkedIn Data Service"

  3. Select a Type of Page. For a tutorial that explains how to create a custom data service that makes web service API calls, refer to “Creating a (Web Service) Custom Data Service”.

  4. Enter a Description: "A test data service for LinkedIn. Extracts hResume data directly from profile pages."

    Even if you don't plan to share your data service, the description is still useful for private development. This is particularly true if you end up creating several data services that have similar functions or that trigger on the same URLs. The description should not only indicate which sites the data service triggers on, but what kinds of data it extracts.

  5. Read the Terms of Service if you have not done so already. Click the Terms of Service checkbox.

  6. Click Next Step. SearchMonkey saves your changes and displays “Step 2: URLs”.

  7. Specify a Trigger URL Pattern of: "*.linkedin.com/in/*,*.linkedin.com/pub/*"

    This pattern matches all results from LinkedIn that fall under the /in and /pub directories. We happen to know that these pages represent individual LinkedIn user profiles — and that they have hResume data for us to extract.

  8. Specify three test URLs:

    • http://www.linkedin.com/in/amitkumar

    • http://www.linkedin.com/in/kevinhaas

    • http://www.linkedin.com/in/mdubinko

    Alternatively, you can click AUTOFIND URLs to retrieve ten valid, random test URLs. However, by using these specific values, you can confirm that your results match the results in this tutorial.

  9. Click Next Step. SearchMonkey saves your changes and displays “Step 3: Data Extraction”.

  10. Creating a (Page) Custom Data Service

    <?xml version="1.0"?>
    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
    <xsl:template match="/">  
    <adjunctcontainer>  
    <adjunct id="smid:{$smid}" version="1.0">  
      <item rel="dc:subject"  
        resource="{//div[@class='hresume']//div[@class='image']/img/@src}">
        <type typeof="media:Photo"/>  
      </item
      <item rel="dc:subject">  
        <type typeof="vcard:VCard">  
            <meta property="vcard:fn">
              <xsl:value-of select="//div[@class='hresume']//span[contains(@class,'fn')]"/>
            </meta>
            <meta property="vcard:title">
              <xsl:value-of select="//div[@class='hresume']//ul[@class='current']/li"/>
            </meta>
        </type>
      </item>
    </adjunct>
    </adjunctcontainer>
    </xsl:template>
    </xsl:stylesheet>

    Boilerplate code — "Start matching templates at the root node."

    Boilerplate code — Specifies the root element for extracted data, <adjunctcontainer>.

    Specifies an <adjunct> element to encase your extracted data. An adjunct may contain zero or more <item> and <meta> elements. You should always set the id attribute to the value "smid:{smid}", which causes SearchMonkey to supply a globally unique ID for you.

    The media:Photo resource is a link to an image, and the vcard:fn and vcard:title are also set to something plausible, a person's full name and a job title respectively.

    The optional resource attribute specifies the URI of the resource that represents the item. In this case, the XPath expression sets the resource attribute to the photo's URL. The XPath expression matches the src attribute for an <image> element within a <div class="image"> within a <div class="hresume">.

    Provides a container for the person's "business card" data. An <item> may contain zero or more <item> and <meta> elements.

    Describes some data on the page. A <meta> contains a literal value (actual data extracted from the page). This particular example sets the property attribute to vcard:fn, indicating that the <meta> represents the person's first name. For a list of acceptable values for the property attribute, consult the vocabulary specification.

    A <xsl:value-of> element that extracts the data specified by the given XPath expression. In this case, the XPath expression matches a <span> with a class of fn that is inside a <div class="hresume">

  11. Click Save and Refresh. SearchMonkey refreshes the Preview Pane, displaying the effects of your data service on the first test URL.

    The data appears to be acceptable. The rel:Photo resource is a link to an image, and the vcard:fn and vcard:title are also set to something plausible, a person's full name and a job title repectively.

    Click Input and Output to view the module's input HTML and output XML. These links are handy for debugging your data service.

    Step through the other test results and verify that the Preview Pane is displaying the expected output.

    [Note] Note

    If there are any problems with the extraction code, the Preview Pane displays a bulleted list of warnings and errors.

  12. Click Next Step. SearchMonkey saves your changes and displays “Step 4: Confirmation”.

  13. Congratulations, you are done with the tutorial! You may now click Create a new Presentation Application and continue to “Creating a Presentation Application” in order to build a presentation application based on this data service. Otherwise, return to the Application Dashboard.