Ever wonder how SearchMonkey generates all of this structured data for use in projects like Enhanced Results, Object Facets, Site Facets, BOSS, and ranking? Have you wondered what Yahoo envisions as "a web of concepts", or how SearchMonkey is helping Yahoo! power its next generation of search experiences? Or perhaps you were searching for "rickroll videos" and reached this page by mistake. No matter — here's almost everything you wanted to know about how we get the structured data to power SearchMonkey.
A monkey standing on the shoulders of giants
First is the easy answer: we get the data from our site owners! Yahoo!'s opinion is that nobody understands the content of a website more than the site owner. Through page markup (both RDFa and microformats), structured data feeds, custom data services created with the SearchMonkey developer tool, and XSLT rules written by third party tools, we'll take whatever data that site owners tell us is important about the content on their site. Even if you're the not the site owner, you can still submit custom data services to SearchMonkey Gallery. If approved, your extraction techniques will be applied by Yahoo Search for others to build applications upon.
What does SearchMonkey feed data look like?
To create a feed, first create an Atom feed of the URLs you want to annotate. Then within each Atom
<entry>, add the metadata for the page. For example, assume you have restaurant listings already shown in Yahoo! Search, and you want to show the average review you've collected from your users. One of the entries in your feed would look like:
<y:adjunct version="1.0" name="local" xmlns:y="https://search.yahoo.com/datarss/"> <y:item rel="dc:subject"> <y:type typeof="vcard:VCard commerce:Business"> <y:item rel="vcard:url" resource="http://local.yahoo.com/info-21328305-yahoo-incorporated-sunnyvale"/> <y:item rel="review:hasReview"> <y:type typeof="review:Review"> <y:meta property="review:rating" datatype="xsd:decimal">4.5</y:meta> <y:meta property="review:totalRatings" datatype="xsd:integer">32</y:meta> </y:type> </y:item> </y:type> </y:item> </y:adjunct>
Which results in:
It's that easy! Create a couple million of these annotations (hopefully not by hand!), and you are ready to submit it to Site Explorer. When Yahoo processes the feed, your results will be enhanced and appear with 4.5 stars and "32 reviews". Users will see your search result, realize that you actually have reviews available from the page, and will go to your page to read reviews of this restaurant. That's not just theory — our lab monkeys have proved repeatedly that enhanced results actually increase traffic to sites.
How do I do this with XSLT?
Not everybody likes building a feed. If you're XML-savvy, you can use XSLT to instruct Yahoo! how to extract the components of your page that you want to appear in your enhanced results. As a quick example, we'll create a rule for Yahoo! Local. With your favorite XPATH plugin for Firefox (mine is XPather), you can quickly identify xpaths for Yahoo! to follow and extract data from. For example, the following xpath extracts "32 reviews":
Insert that xpath into Yahoo-provided boilerplate XSLT, and we'll run that to extract the structured data from your site. If this approach seems pretty fragile because it assumes the HTML structure will not change... you're right. Some sites seldom change their HTML, and other sites are really good about naming the important nodes with unique ids. For those sites, XSLT rules are fairly resilient to changes in the page structure. For other sites that have high variation in their page structure or that are constantly making sweeping changes, XSLT extraction usually doesn't end very well. Fortunately, there's yet another approach.
But wait! There's more!
Next is the clever way to do things: it turns out that lot of the web has already been structured by the semantic web. Yahoo! already has billions of documents which have been annotated by either RDF or microformats. This semantic information provides some solid hints about the structured data on the page, which we happily use for SearchMonkey.
Annotating for the semantic web essentially involves tweaking your page markup to provide additional meaning about the elements on that page. The example below uses RDFa, but you can also provide the same information to Yahoo Search using microformats.
<div typeof="vcard:VCard commerce:Business" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:vcard="http://www.w3.org/2006/vcard/ns#" xmlns:commerce="https://search.yahoo.com/SearchMonkey/commerce/" xmlns:review="http://purl.org/stuff/rev#"> <h1><span property="vcard:fn">Yahoo! Incorporated</span></h1> <span> <span>User Rating: <span property="review:rating">4.5</span> out of <span property="review:maxRating">5</span> stars (<span property="review:totalRatings">32</span> reviews). </span> </span> </div>
When Yahoo sees the additional RDFa markup in the HTML, we extract the structured data complete with semantics. Your page is no longer a bag of unstructured words — Yahoo! now has information that helps us understand your page better.
But what about the rest of the web?
Finally, the answer from the big brains: magic. A lot of research has gone into "web data mining" both inside Yahoo, and at academic and corporate research institutions worldwide. Deep inside Yahoo! Research, tribes of monkeys are busy creating new technologies for Yahoo to extract objects out of web pages. I wish I could provide a few examples about this magic, but there is no way that I can summarize decades of named entity recognition research in five lines or less. For a good place to get started, see publications on TREC.
Any other methods that Yahoo! uses to extract structured data from web content?
Pizza — lots of pizza.
Are you a site owner interested in how you can build your site to make it easier for us to extract the structured data? We would love your participation.
Interested in being a monkey? We're also hiring!
Senior Engineering Manager, Yahoo! SearchMonkey