Home | Index

SearchMonkey Guide

Understanding DataRSS

DataRSS is a specification for conveying structured data for URLs, in a SearchMonkey data service or using a conventional feed format. Each URL in a DataRSS feed has one or more adjuncts. The adjunct is the fundamental unit of organization in DataRSS-consuming SearchMonkey applications. Meaning "something alongside", the adjunct element represents the metadata that goes alongside an actual resource, such as a product listing, or a product review. The adjunct groups the related metadata from a particular source, such as hcard data about the page's owner, or technical data about a photo on the page.

Each adjunct contains one or more <item>s and <meta>s. An item represents some object or concept in the real world, while a meta presents a property of a particular item. The Web page the adjunct is alongside can also have meta properties with no intervening item elements, as shown in the upcoming example. Items may contain <meta>s and other items, while <meta>s only contain literal values.

A few examples:

For example, here we have a section from an atom feed:

 <atom:entry>
  <atom:id>http://the.url/in/question</atom:id>
  ...
  <y:adjunct version="1.0" name="com.yahoo.test">
    <y:meta property="tagspace:tags">tag 1 tag2 tag3 tag4</meta>
    <y:item rel="dc:subject" resource="http://photosite.com/img.jpg">
      <y:type typeof="media:Photo">
        <y:meta property="dc:creator">The Nameless One</meta>
      </y:type>
    </y:item>
  </y:adjunct>
</atom:entry>

Figure 3.1. Triples Diagram

Triples Diagram

DataRSS is an XML format designed to deliver a wide array of structured data. As the common data layer for all SearchMonkey applications, DataRSS enables you to distribute your structured data to millions of people. The trick is to represent that data as DataRSS in the first place. Yahoo! already provides a great deal of information gathered by the Yahoo! Search Crawler as DataRSS, as do a number of third parties who have already set up DataRSS feeds. For other data sources, you can use SearchMonkey to construct custom data services that transform the source data into valid DataRSS.

SearchMonkey adds, removes, and updates metadata in terms of entire adjuncts, and the system will never break them apart or join them together. Because each adjunct has a unique identifier assigned by the system, there is always a way to refer to a particular adjunct. Therefore, adjuncts can be updated or replaced as a unit as the underlying pages change.

In addition, since each adjunct serves as a container for the metadata and item definitions within, everything that is "said" in the metadata of a particular adjunct is attributable to a particular source. This enables different people and groups to say their own things about any resource.

The ability of site owners to separate metadata into adjuncts gives them flexibility in how their metadata is assembled and represented. Developers making use of the data will be able to "subscribe" to different adjuncts containing the data needed by their application.

For complete details on DataRSS, see the Yahoo! DataRSS Specification.

DataRSS Elements and Attributes

Search Monkey Applications can make use of data which the user submits to the Yahoo! index. Data is submitted in the form of DataRSS feeds which the user submits through the Site Explorer "Submit a Feed" process.

[Note] Note

When using the Site Explorer submit process the user must be authenticated on a site in order to submit a SearchMonkey feed. SearchMonkey feeds are automatically validated by Yahoo! Site Explorer when the feed is submitted.

DataRSS has four elements that are relevant to SearchMonkey site owners:

<adjunct>

A container of metadata associated with a URL. Within the SearchMonkey developer tool, each data service provides a single <adjunct> with a unique ID.

Generally, the format of the adjunct is com.yahoo.source.type.value (where type and source are defined by the data architect and value is left to the adjunct owner). This is to enforce consistency across all feed contributors.

Yahoo! currently uses three adjunct types. They are:

  • com.yahoo.page.uf - this adjunct supports microformats which Yahoo! extracts when crawling your site. An example of this type of adjunct is com.yahoo.page.uf.hcard

  • com.yahoo.page.rdf - this adjunct supports RDF which Yahoo! extracts when crawling your site. An example of this type of adjunct is com.yahoo.page.rdf.erdf

  • com.yahoo.feeds.searchmonkey - this adjunct supports SearchMonkey publisher feeds submitted thorough Site Explorer.

Attributes:

  • name — The adjunct's name. Choose a name that describes your metadata, such as "com.website.products."

  • updated — (Optional) The last updated timestamp of this adjunct as an ISO 8601 date-time stamp. If individual entries don't have a last-updated timestamp, the overall feed must have one, and all entries will be given the same timestamp.

  • version — A numeric version string ("1.0") that indicates the version in use for this particular adjunct. If the adjunct format changes substantially, you should increment this number.

<meta>

A specific metadata assertion for the parent <adjunct> or <item>. The <meta> element contains a literal value specifying the value of the assertion. For example, a listing for a camera has properties that include the title and the list price, taken from the dc and product vocabularies, respectively.

<y:adjunct name="com.website.products" version="1.0">
  <y:item rel="dc:subject">
    <y:type typeof="product:Product">
        <y:meta property="dc:title">Canon PowerShot SD800 IS Digital ELPH Digital Camera</y:meta>
        <y:meta property="product:listPrice" datatype="currency:USD">260.00</y:meta>
        ...
    </y:type>
  </y:item>
</y:adjunct>

In RDF parlance, a <meta> element and its property attribute represent a liternal object and predicate. The <meta> element's value is always a literal. A <meta> element's value should never be the URL for a resource; use <item rel="rel" resource="resourceURI"> instead.

Attributes:

  • property — A CURIE (or list of CURIEs) specifying the properties which take the value inside the element. For example, property="vcard:bday" indicates that the metadata is a bday (birthday) of the contact, as defined by the vcard vocabulary. In RDFa, this corresponds to an element's property attribute. For some of the standard properties see the SearchMonkey vocabularies."

  • datatype — (Optional) A CURIE specifying the datatype of a metadata value. Most properties are strings, but you can use datatype to specify a more restrictive type, such as currency:USD. For a list of possible datatype values, refer to the SearchMonkey Site Owner Guide. In RDFa , this corresponds to an element's datatype attribute.

<item>

A physical item, concrete concept, or task described by the feed, with a rel attribute describing the relationship of this object to the current resource and an optional resource attribute pointing to the URL that represents this item. For example, an image on a page is an <item>, with a rel indicating that the item is a photo and a resource pointing to the image file. Within the <item> can be more <item> elements or metadata assertions.

<y:adjunct id="smid:{$smid}" version="1.0">
  <y:item rel="dc:subject" resource="http://photosite.com/img.jpg">
    <y:type typeof="media:Photo>
        <y:item rel="review:hasReview">
           <y:type typeof="review:Review">
              <y:meta property="dc:creator">Joe Smith</y:meta>
           </y:type>
        </y:type>
    </y:type>
  </y:item>
</y:adjunct>

In RDF parlance, each <item> element establishes a triple between the parent item (or adjunct) and another object, possibly a "blank node", and sets the new object as the current resource.

[Note] Note

The <item> element is completely optional. If you have only simple assertions to make about the entire page (and not items within the page), you can embed <meta> elements directly inside the <adjunct>.

Attributes:

  • rel — A CURIE (or space-separated list of CURIEs) specifying the relationship of this object to the current resource, using one or more properties from a vocabulary. For example, rel="rel:hasReview" indicates that review is a review of the photo (the enclosing item). In RDFa, this corresponds to an element's rel attribute.

  • resource — (Optional) A URL specifying the web resource that represents this item. For example, an item that is a video should have a resource pointing to the actual video file location. If the item does not have a corresponding web resource, you can omit resource. In RDFa, this corresponds to an element's resource attribute.

[Important] Important

<item> elements have rel attributes and <meta> elements have property attributes. When dealing with DataRSS, take care not to confuse the two.

<type>

The type element provides the type(s) of the enclosing element.

Attributes:

  • typeof — A CURIE (or a space-separated list of CURIEs) describing the type(s) of the enclosing element. Types should be classes chosen from the vocabularies.

Example 3.1, “Example DataRSS” illustrates a short example DataRSS feed.

Example 3.1. Example DataRSS

<atom:entry>
  <atom:id>http://the.url/in/question</atom:id>
  <y:adjunct name="com.website.products" version="1.0">
    <y:item rel="rel:Product">
      <y:meta property="product:listPrice" datatype="currency:USD">12.99</y:meta>
      <y:meta property="product:shippingCost" datatype="currency:USD">0</y:meta>
      <y:meta property="product:shippingWeight" datatype="units:g">500</y:meta>
      <y:item rel="rel:Review" 
            resource="http://www.onlinestore.com/reviews/12345/browse"/>
      </y:item>
    </y:adjunct>
</atom:entry>

DataRSS in SearchMonkey Data Services

The full DataRSS specification is an appendix in this guide. If you read the specification, you will notice some differences between DataRSS as used in feeds and DataRSS as it is produced by XSLT in the SearchMonkey Developer Tool. In SearchMonkey, from a data service,

  • DataRSS output does not contain namespaced elements and attributes. Within the context of a data service, this extra level of disambiguation is not needed.

  • DataRSS output uses a wrapper element called <adjunctcontainer> instead of being a payload inside Atom entry elements.

  • <adjunct> elements have an id attribute which is always populated by the system. The name attribute is not allowed.

These differences result from design decisions made to simplify coding for SearchMonkey developers. Site owners find it useful that DataRSS piggybacks off of Atom, since they can leverage tools and knowledge they have about Atom to craft valid DataRSS feeds. However, the SearchMonkey developer tool is designed specifically around manipulating <adjunct>s, <item>s, and <meta>s. Therefore, site owners don't need to inform the SearchMonkey developer tool about the namespacing for SearchMonkey-specific elements — this is already understood.