The public Internet contains vast quantities of information, but that information is poorly structured and difficult for machines to reliably extract. Although it is possible to extract interesting structured data from HTML — and SearchMonkey provides you with tools to do this — HTML by itself has a limited vocabulary, and a page's HTML structure mostly depends on the whims of the page designer.
Over the years, various parties have proposed different ways to help authors express semantic relationships on the web, including:
Yahoo! has adapted the latest version of the Yahoo! Web Crawler to extract embedded RDF and microformat data. SearchMonkey then makes this data freely available to all third party developers. For example, if the developer's application triggers on a URL that contains hCalendar data, they can include this data in their presentation application with the click of a button. Once the data is in their feed, the application can display the event's date and start time just as easily as any other item of information on that page.
It may take a several days for Yahoo! to crawl a page, but there is no guaranteed time frame.
The rest of this section provides some very minimal background about microformats and RDF. As a site owner, understanding the basics should give you an idea of the kinds of applications can be built using your embedded data.
Microformats are a set of specifications for embedding semantic
data in (X)HTML using standard markup. HTML already has a limited
ability to express semantic meaning; for example, the
<em> element indicates emphasis, while a
<cite> element indicates a citation. Microformats
attempt to increase HTML's expressive power by building new data formats
on top of standard HTML elements and attributes. Each microformat
defines combinations of <abbr>, class,
rel, and other standard HTML markup to specify structured
information about people, events, and other items of interest.
Example 3.2. Contact Information in hCard Format
Many web pages attempt to provide addresses, albeit in an
unstructured format. HTML does offer the <address>
element:
but this element only indicates that the content is an address, and says nothing about its structure. A software program might be able to "guess" the structure using regular expressions or other techniques. However, the problem grows more and more challenging as you consider more types of contact information, international address formats, and other complications.
By contrast, an address marked up with hCard is much less ambiguous:
This hCard representation of the address is not only easy for users to read in a browser, but also easy for software programs to extract and reuse.
What does all this mean for SearchMonkey developers and site owners? If the Yahoo! Web Crawler encounters any page with any of the following embedded microformats:
then the Yahoo! Web Crawler extracts the data, indexes it, and provides that information to you. This opens the door to all sorts of interesting applications, particularly if you scope your application to sites that you know supply particular microformats. For example, if a reviews website is designed so that each review embeds hReview data, you can easily write a SearchMonkey application to directly display ratings and review descriptions.
For more information about microformats, refer to the Microformats website.
The following microformats are supported by SearchMonkey.
| fn | url | title |
| tel | org | |
| photo |
| dtstart | dtend | duration |
| location | summary | url |
| best | dtreviewed | fn |
| rating | summary | url |
| worst |
| author | bookmark | entry-title |
| hentry | hfeed | updated |
| acquaintance | child | colleague |
| contact | co-resident | co-worker |
| crush | date | friend |
| kin | me | met |
| muse | neighbor | parent |
| sibling | spouse | sweetheart |
The Resource Description Framework (RDF) is a standard language for storing metadata about web resources. For example, a web page about the Southern Laughing Tree Frog provides data (information about the frog) along with metadata (information about the page, such as the author's name or a copyright statement). Individual elements on the page can also have metadata. For example, a photo of the frog could provide metadata about the photo's image format or the timestamp when the photo was taken.
The goal of RDF is to provide a common framework for specifying this metadata. Fundamentally, you can decompose any RDF data into one or more triples:
If the object is a node or blank node, it can in turn serve as
the subject for other items of metadata. In
other words, RDF permits you to have metadata about metadata. If the
object is a literal, it represents a literal value such as
"Joe Smith" or 1/137.035. Literal values
are endpoints; you can't chain more RDF triples off of them.
As an example, the FOAF specification is a
collection of useful URIs for describing people and their
connections. Within the FOAF specification, the foaf:depiction
URI indicates that object (an image) is a depiction of the
subject. Just as you can use <h1> to inform web
browsers, "This is a top-level heading," you can use
foaf:depiction to inform RDF-aware crawlers, "This
subject has a depiction, which is this image object over
here."
How would you represent RDF information in general? Figure 3.3, “RDF Triples for Joe's Home Page” illustrates an RDF graph for a simple home page.
To decompose this graph into individual RDF triplets:
http://joesmith.org has a
foaf:maker (an author). The author doesn't have a
unique URI, so the object is a blank node.
http://joesmith.org also has a
dc:title (a title). The title is a literal value,
"Joe's Home Page".
http://joesmith.org has a
foaf:name (a name). The name is a literal value,
"Joe Smith".
http://joesmith.org also
has a foaf:depiction (an image depiction). The image
has a unique URI resource,
http://joesmith.org/images/jsmith.png.
http://joesmith.org has dc:rights (copyright
or licensing information). The copyright is a literal value,
"Creative Commons Attribution 3.0 Unported".
These sorts of relationships are vital for any software attempting
to extract semantic information. A web browser or search engine crawler
can easily extract objects and subjects from web pages, but without the
predicates to indicate semantic relationships, the meaning is lost. Is
the resource http://joesmith.org/images/jsmith.png a picture
of Joe Smith? Joe Smith's sister-in-law? Joe Smith's classic space Legos
collection? A Southern Laughing Tree Frog?
Of course, you can try to use complicated algorithms to "guess"
the nature of the photo. However, RDF is a way to provide that kind of
information in the first place. The RDF in Figure 3.3, “RDF Triples for Joe's Home Page” asserts that the image is in fact a
depiction (foaf:depiction) of the page author
(foaf:maker). If you are designing software to, say,
display photographs and other information about people, this kind of
information is invaluable.
As with microformats, if the Yahoo! Web Crawler encounters any page with embedded RDF, the crawler extracts the data, indexes it, and provides that information to you, the SearchMonkey developer. But how do users actually embed RDF in their pages? Yahoo! supports approaches: RDFa and eRDF.
RDFa relies on using attributes from the <link>
and <meta> elements to embed RDF data in XHTML. Using
these attributes in this manner is not strictly valid in vanilla XHTML
1.1, but XHTML's modular nature enables you to extend the XHTML DTD to
include RDFa (and other) semantics. As a SearchMonkey developer, it is
not strictly necessary to understand how to produce RDFa, but Example 3.3, “Joe's Home Page with RDFa Markup” illustrates how this might be done in
practice:
For more information about RDFa, refer to the RDFa specification or the RDFa Primer.
eRDF is an alternative approach for embedding RDF information. Unlike RDFa, it is possible to embed eRDF in XHTML or HTML. However, eRDF supports a more limited subset of RDF than RDFa. Example 3.4, “Joe's Home Page with eRDF Markup” illustrates how Joe might use eRDF to describe his homepage:
For more information about eRDF, refer to the eRDF specification or the eRDF wiki.