Home | Index

SearchMonkey Guide

Leveraging the Data Web

The public Internet contains vast quantities of information, but that information is poorly structured and difficult for machines to reliably extract. Although it is possible to extract interesting structured data from HTML — and SearchMonkey provides you with tools to do this — HTML by itself has a limited vocabulary, and a page's HTML structure mostly depends on the whims of the page designer.

Over the years, various parties have proposed different ways to help authors express semantic relationships on the web, including:

Yahoo! has adapted the latest version of the Yahoo! Web Crawler to extract embedded RDF and microformat data. SearchMonkey then makes this data freely available to all third party developers. For example, if the developer's application triggers on a URL that contains hCalendar data, they can include this data in their presentation application with the click of a button. Once the data is in their feed, the application can display the event's date and start time just as easily as any other item of information on that page.

Figure 3.2. Mapping the Data Service to the Application

Mapping the Data Service to the Application

It may take a several days for Yahoo! to crawl a page, but there is no guaranteed time frame.

The rest of this section provides some very minimal background about microformats and RDF. As a site owner, understanding the basics should give you an idea of the kinds of applications can be built using your embedded data.

Microformats

Microformats are a set of specifications for embedding semantic data in (X)HTML using standard markup. HTML already has a limited ability to express semantic meaning; for example, the <em> element indicates emphasis, while a <cite> element indicates a citation. Microformats attempt to increase HTML's expressive power by building new data formats on top of standard HTML elements and attributes. Each microformat defines combinations of <abbr>, class, rel, and other standard HTML markup to specify structured information about people, events, and other items of interest.

Example 3.2. Contact Information in hCard Format

Many web pages attempt to provide addresses, albeit in an unstructured format. HTML does offer the <address> element:

<address>
Joe Smith
123 Murphy Avenue, Sunnyvale, California 94086
(408) 555-1234
</address>

but this element only indicates that the content is an address, and says nothing about its structure. A software program might be able to "guess" the structure using regular expressions or other techniques. However, the problem grows more and more challenging as you consider more types of contact information, international address formats, and other complications.

By contrast, an address marked up with hCard is much less ambiguous:

<div id="hcard-Joe-Smith" class="vcard">
 <span class="fn">Joe Smith</span>
 <div class="adr">
  <div class="street-address">123 Murphy Avenue</div>
  <span class="locality">Sunnyvale</span>, 
  <span class="region">California</span> 
  <span class="postal-code">94086</span>
 </div>
 <div class="tel">(408) 555-1234</div>
</div>

This hCard representation of the address is not only easy for users to read in a browser, but also easy for software programs to extract and reuse.

What does all this mean for SearchMonkey developers and site owners? If the Yahoo! Web Crawler encounters any page with any of the following embedded microformats:

  • hAtom — represents a subset of the Atom syndication format

  • hCalendar — represents calendar dates and events, using a representation of the iCalendar standard

  • hCard — represents people, companies, organizations, and places, using a representation of the vCard standard

  • hReview — represents reviews of products, services, businesses, and events

  • XFN — represents human relationships using hyperlinks

then the Yahoo! Web Crawler extracts the data, indexes it, and provides that information to you. This opens the door to all sorts of interesting applications, particularly if you scope your application to sites that you know supply particular microformats. For example, if a reviews website is designed so that each review embeds hReview data, you can easily write a SearchMonkey application to directly display ratings and review descriptions.

For more information about microformats, refer to the Microformats website.