Yahoo! Search is now extracting RDFa data across the World Wide Web and making this information available to the public via SearchMonkey. RDFa is an open standard for embedding structured data directly in HTML. Along with our previous support for eRDF and a number of popular microformats, SearchMonkey now supports a wide variety of popular semantic technologies.
What is structured data, and why is structured data good for search? Traditional search engines crawl the web and extract what metadata they can: the page title, an autogenerated summary, the file size, the MIME-type, the last-updated date, and so on. However, this sort of analysis pales in comparison to what a human being can do simply by glancing at the page. A human can look at the words "Joe's Home Page" and infer, "ah, this page probably belongs to Joe," or look at an image and infer, "ah, that's probably a picture of Joe, the owner of the page." That's easy enough for humans... but what if the search engine could pick out this info and display it directly in the search result?
RDFa relies on using attributes to embed structured data in XHTML. These attributes are not valid in HTML 4, but the W3C has provided an XHTML DTD to validate against. The following example illustrates a simple home page marked up with RDFa data (in bold):
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:foaf="http://xmlns.com/foaf/0.1/" lang="en" xml:lang="en-US"> <head> <title>The Amazing Home Page of Joe Smith</title> </head> <body> <h1 property="dc:title">Joe's Home Page</h1> <div rel="foaf:maker"> <h2 property="foaf:name">Joe Smith</h2> <div rel="foaf:depiction" resource="http://joesmith.org/images/jsmith.png"> <img src="/images/jsmith.png" alt="Smiling headshot of Joe" /> <p property="dc:rights">Creative Commons Attribution 3.0 Unported</p> </div> </div> </body> </html>
In this page, the designer has explicitly stated that the image is a "depiction of the person who made the web page." Adding this information as RDFa can potentially benefit many applications. In the case of Yahoo!, we've designed our search index to extract and store this information.
RDFa support has already enabled some interesting new SearchMonkey applications. For instance, Creative Commons has recently started to deploy RDFa across the web in the form of copyright and licensing information. Every time a Creative Commons user selects a CC license, the generated HTML badge contains RDFa markup indicating the nature of the license. The Creative Commons Infobar uses this data to selectively trigger on pages that declare their license using structured markup:
To get started with RDFa:
- Learn the basics with the W3C RDFa Primer
- Dive into the details with the full RDFa Specification
- Join the community at the RDFa homepage
- Test your structured markup skills with the RDFa Distiller
- Filter on RDFa in Yahoo! searches with the
- Start displaying RDFa to millions of users with the SearchMonkey developer tool
Yahoo! SearchMonkey Team