Leveraging the Data Web

The public Internet contains vast quantities of information, but that information is poorly structured and difficult for machines to reliably extract. Although it is possible to extract interesting structured data from HTML — and SearchMonkey provides you with tools to do this — HTML by itself has a limited vocabulary, and a page's HTML structure mostly depends on the whims of the page designer.

Over the years, various parties have proposed different ways to help authors express semantic relationships on the web, including:

  • Microformats — an approach for using existing HTML syntax to embed certain kinds of structured data
  • RDF — a language for representing metadata about web resources

Yahoo! has adapted the latest version of the Yahoo! Web Crawler to extract embedded RDF and microformat data. SearchMonkey then makes this data freely available to all third party developers. For example, if the developer's application triggers on a URL that contains hCalendar data, they can include this data in their presentation application with the click of a button. Once the data is in their feed, the application can display the event's date and start time just as easily as any other item of information on that page.

Mapping the Data Service to the Application

Figure 3.2. Mapping the Data Service to the Application


It may take a several days for Yahoo! to crawl a page, but there is no guaranteed time frame.

The rest of this section provides some very minimal background about microformats and RDF. As a site owner, understanding the basics should give you an idea of the kinds of applications can be built using your embedded data.

Microformats

Microformats are a set of specifications for embedding semantic data in (X)HTML using standard markup. HTML already has a limited ability to express semantic meaning; for example, the <em> element indicates emphasis, while a <cite> element indicates a citation. Microformats attempt to increase HTML's expressive power by building new data formats on top of standard HTML elements and attributes. Each microformat defines combinations of <abbr>, class, rel, and other standard HTML markup to specify structured information about people, events, and other items of interest.

Example 3.2. Contact Information in hCard Format

Many web pages attempt to provide addresses, albeit in an unstructured format. HTML does offer the <address> element:

but this element only indicates that the content is an address, and says nothing about its structure. A software program might be able to "guess" the structure using regular expressions or other techniques. However, the problem grows more and more challenging as you consider more types of contact information, international address formats, and other complications.

By contrast, an address marked up with hCard is much less ambiguous:

This hCard representation of the address is not only easy for users to read in a browser, but also easy for software programs to extract and reuse.


What does all this mean for SearchMonkey developers and site owners? If the Yahoo! Web Crawler encounters any page with any of the following embedded microformats:

  • hAtom — represents a subset of the Atom syndication format
  • hCalendar — represents calendar dates and events, using a representation of the iCalendar standard
  • hCard — represents people, companies, organizations, and places, using a representation of the vCard standard
  • hReview — represents reviews of products, services, businesses, and events
  • XFN — represents human relationships using hyperlinks

then the Yahoo! Web Crawler extracts the data, indexes it, and provides that information to you. This opens the door to all sorts of interesting applications, particularly if you scope your application to sites that you know supply particular microformats. For example, if a reviews website is designed so that each review embeds hReview data, you can easily write a SearchMonkey application to directly display ratings and review descriptions.

For more information about microformats, refer to the Microformats website.

Supported Microformats

The following microformats are supported by SearchMonkey.

hcard

fn url title
tel email org
photo    

hevent

dtstart dtend duration
location summary url

hreview

best dtreviewed fn
rating summary url
worst    

hatom

author bookmark entry-title
hentry hfeed updated

xfn 1.1

acquaintance child colleague
contact co-resident co-worker
crush date friend
kin me met
muse neighbor parent
sibling spouse sweetheart

RDF

The Resource Description Framework (RDF) is a standard language for storing metadata about web resources. For example, a web page about the Southern Laughing Tree Frog provides data (information about the frog) along with metadata (information about the page, such as the author's name or a copyright statement). Individual elements on the page can also have metadata. For example, a photo of the frog could provide metadata about the photo's image format or the timestamp when the photo was taken.

The goal of RDF is to provide a common framework for specifying this metadata. Fundamentally, you can decompose any RDF data into one or more triples:

  • The subject, the thing that the metadata is describing. Subjects are nodes, depicted graphically as ovals. Subjects may be represented by a URI. If no such URI exists, the subject is a blank node.
  • The object, the value of the metadata. An object can be a node (represented by a URI), a blank node (represented by no URI), or a literal, depicted graphically as a rectangle.

    If the object is a node or blank node, it can in turn serve as the subject for other items of metadata. In other words, RDF permits you to have metadata about metadata. If the object is a literal, it represents a literal value such as "Joe Smith" or 1/137.035. Literal values are endpoints; you can't chain more RDF triples off of them.

  • The predicate, the relationship between the subject and the object. Predicates are always URIs, and are depicted graphically as arcs. Different predicate URIs serve as a common vocabulary for describing relationships. Predicates are also known as properties.

    As an example, the FOAF specification is a collection of useful URIs for describing people and their connections. Within the FOAF specification, the foaf:depiction URI indicates that object (an image) is a depiction of the subject. Just as you can use <h1> to inform web browsers, "This is a top-level heading," you can use foaf:depiction to inform RDF-aware crawlers, "This subject has a depiction, which is this image object over here."

How would you represent RDF information in general? Figure 3.3, “RDF Triples for Joe's Home Page” illustrates an RDF graph for a simple home page.

RDF Triples for Joe's Home Page

Figure 3.3. RDF Triples for Joe's Home Page


To decompose this graph into individual RDF triplets:

  1. The website http://joesmith.org has a foaf:maker (an author). The author doesn't have a unique URI, so the object is a blank node.
  2. The website http://joesmith.org also has a dc:title (a title). The title is a literal value, "Joe's Home Page".
  3. The author of the website http://joesmith.org has a foaf:name (a name). The name is a literal value, "Joe Smith".
  4. The author of the website http://joesmith.org also has a foaf:depiction (an image depiction). The image has a unique URI resource, http://joesmith.org/images/jsmith.png.
  5. The depiction of the author of the website http://joesmith.org has dc:rights (copyright or licensing information). The copyright is a literal value, "Creative Commons Attribution 3.0 Unported".

These sorts of relationships are vital for any software attempting to extract semantic information. A web browser or search engine crawler can easily extract objects and subjects from web pages, but without the predicates to indicate semantic relationships, the meaning is lost. Is the resource http://joesmith.org/images/jsmith.png a picture of Joe Smith? Joe Smith's sister-in-law? Joe Smith's classic space Legos collection? A Southern Laughing Tree Frog?

Of course, you can try to use complicated algorithms to "guess" the nature of the photo. However, RDF is a way to provide that kind of information in the first place. The RDF in Figure 3.3, “RDF Triples for Joe's Home Page” asserts that the image is in fact a depiction (foaf:depiction) of the page author (foaf:maker). If you are designing software to, say, display photographs and other information about people, this kind of information is invaluable.

As with microformats, if the Yahoo! Web Crawler encounters any page with embedded RDF, the crawler extracts the data, indexes it, and provides that information to you, the SearchMonkey developer. But how do users actually embed RDF in their pages? Yahoo! supports approaches: RDFa and eRDF.

RDFa relies on using attributes from the <link> and <meta> elements to embed RDF data in XHTML. Using these attributes in this manner is not strictly valid in vanilla XHTML 1.1, but XHTML's modular nature enables you to extend the XHTML DTD to include RDFa (and other) semantics. As a SearchMonkey developer, it is not strictly necessary to understand how to produce RDFa, but Example 3.3, “Joe's Home Page with RDFa Markup” illustrates how this might be done in practice:

Example 3.3. Joe's Home Page with RDFa Markup


For more information about RDFa, refer to the RDFa specification or the RDFa Primer.

eRDF is an alternative approach for embedding RDF information. Unlike RDFa, it is possible to embed eRDF in XHTML or HTML. However, eRDF supports a more limited subset of RDF than RDFa. Example 3.4, “Joe's Home Page with eRDF Markup” illustrates how Joe might use eRDF to describe his homepage:

Example 3.4. Joe's Home Page with eRDF Markup


For more information about eRDF, refer to the eRDF specification or the eRDF wiki.

Table of Contents