
A SearchMonkey application is built from one or more data services, which provide structured information to display in Yahoo! search results, and a presentation application, which defines how Yahoo! Search should display the data.
Figure 1.2, “Structure of a SearchMonkey Application” illustrates the different types of data services that a presentation application can use.
The common XML language for providing data from any source to SearchMonkey applications is called DataRSS, a specification for embedding URL metadata in standard Atom feeds. Once data is available in DataRSS, SearchMonkey developers use PHP to map that data into a presentation application, which reconfigures and enhances individual search results.
SearchMonkey supports the following types of data services:
Yahoo! Index — Core search data gathered by the Yahoo! Search Crawler, also known as Yahoo! Slurp. This includes the page's title, summary, file size, MIME type, and other kinds of information that search engine spiders have gathered for years. Standard Yahoo! search results are constructed using the Yahoo! index.
In SearchMonkey, the Yahoo! Index is specified by the ID
yahoo:index. While other data services might provide more
customized data, the Yahoo! Index provides core technical information
about each of the billions of web pages that Yahoo! crawls. For more
information, refer to “Yahoo! Index Data”.
Semantic Web Data — Microformats and RDF
data gathered by the Yahoo! Search Crawler. This includes
extracted eRDF, RDFa, and microformats such as hcard and
XFN. If a page embeds semantic web data, the Yahoo!
Search Crawler can automatically extract this information and provide
it to SearchMonkey developers. Yahoo! caches this information on the
server side, so retrieving this information is relatively fast. As
with the Yahoo! Index, extracted semantic web data refreshes whenever
the page gets crawled, so there is always a delay before Yahoo! can
pick up any changes.
In SearchMonkey, semantic web data services are specified by IDs
such as com.yahoo.uf.hcard for hcard data and
com.yahoo.rdf.erdf for eRDF data. Microformats and RDF
enable site owners to provide more meaning about their page to
software entities. For example, you can use hcard markup to indicate
that a particular string is actually a street address — something easy
for human beings to recognize, but more difficult for software. For
more information, refer to “Leveraging the Data Web” and Chapter 3, Site Owner Guide.
Data Feed — A feed of native DataRSS provided by a third party site, such as Amazon or LinkedIn. This includes any supplimentary information about a URL that a site owner chooses to provide. Once the site owner creates the data feed, he or she submits it to Yahoo! Site Explorer so that Yahoo! can index the feed and provide it to SearchMonkey developers. As with semantic web data, Yahoo! caches this information on the server side to improve SearchMonkey application performance.
![]() |
Note |
|---|---|
Feed acceptance by Yahoo Search is subject to quality and capacity constraints. |
In SearchMonkey, data feeds are specified by IDs such as
sm031-LinkedIn. Data feeds are an excellent way to
provide rich information about your site, particularly if you can't
currently afford to redesign your site to include embedded
microformats or RDF. For an overview of DataRSS, refer to Chapter 3, Site Owner Guide, and for full, schema-level documentation, refer
to Appendix A, DataRSS Specification. See also the ysop-siteowners
Yahoo! Group.
Custom Data Service — Any data extracted from an (X)HTML page or web service and represented within SearchMonkey as DataRSS. There are two types of custom data services:
Page — for any data that happens to be trapped in a blob of HTML. If you understand the structure of the underlying web page, you can create a custom data service to extract this data and provide it in DataRSS format to your presentation application.
Web Service — for any data returned by a web service. You can create a custom data service to call a web service API and transform the results into DataRSS. If the web service returns OpenSearch XML, SearchMonkey can automatically transform those results to DataRSS for you.
You create custom data services within SearchMonkey itself. Custom data services enable you to extract information from nearly any page on the web, limited only by your imagination and your ability to write extraction code. The disadvantage of custom data services is that they are slow, and remain uncached until the user runs a search query that triggers the data service.
In SearchMonkey, custom data services are specified by
autogenerated IDs such as smid:aaWFb. Although custom
data services can be slow, you can use them in a two-stage approach.
First, you can create a custom data service for your pages. If that
proves successful, you can mark up your pages with microformats or
RDF, or create a full-fledged data feed.