Extracting HTML

A key feature of YQL is the ability to access data from structured data feeds such as RSS and ATOM. However, if no such feed is available, you can use the html table to get HTML from a page. The html table can parse HTML 4.01 or HTML5 and be used with XPath to extract portions of the HTML page.

Parsing HTML 4.01 Versus HTML5

The html table by default parses any HTML page using HTML 4.01 specification, but it can be configured to parse using the HTML5 specification. To parse using HTML5 specification, you assign the compat key the value "html5" as seen in the first example below. When using the HTML5 specification to parse, the returned response may be slightly different.

For example, to get HTML from Yahoo Groups that is parsed according the the HTML5 specification, you would use the following YQL statement containing compat="html5":

select * from html where url="select * from html where url="http://groups.yahoo.com/search?query=surfing&sort=relevance" and compat="html5"

Run this example in the YQL Console

The same statement without compat="html" returns HTML parsed according to the HTML 4.01 specification:

select * from html where url="http://groups.yahoo.com/search?query=surfing&sort=relevance"

Using XPath

The html table without XPath returns all of the page's HTML, which may not be useful in an application. By adding an XPath expression to the statement, you can retrieve specific portions of the HTML page.

The XPath expression in the following statement traverses through the nodes in the HTML page to isolate the latest headlines. In this case, the XPath expression looks first for a li tag with the class hbox groupsSearch-result-entry.

select * from html where url="http://groups.yahoo.com/search?query=surfing&sort=relevance" and compat="html5" and xpath='//li[contains(@class,"hbox groupsSearch-result-entry")]'

Run this example in the YQL console

The following statement also gets information about Yahoo Groups, but traverses the nodes to get links to the different groups:

select * from html where url="http://groups.yahoo.com/search?query=surfing&sort=relevance" and compat="html5" and xpath='//li[contains(@class,"hbox groupsSearch-result-entry")]/h4/a'

Selecting Content

Instead of the the wildcard asterisk (*), you can specify a particular element to process. To get just the content from an HTML page, you can specify content keyword after the word select. A statement with the content keyword processes the HTML in the following order:

  1. It looks for any element named "content" within the elements found.
  2. If an element named "content" is not found, the statement looks for an attribute named "content".
  3. If neither an element nor attribute named "content" is found, the statement returns the element's textContent.

For example, the following statement extracts only the HTML links (href tags) for Yahoo Groups:

select href from html where url="http://groups.yahoo.com/search?query=surfing&sort=relevance" and compat="html5" and xpath='//li[contains(@class,"hbox groupsSearch-result-entry")]/h4/a'

Run this example in the YQL console

The following statement, for example, returns the textContent of each anchor (a) tag retrieved by the XPath expression:

select content from html where url="http://groups.yahoo.com/search?query=surfing&sort=relevance" and compat="html5" and xpath='//li[contains(@class,"hbox groupsSearch-result-entry")]/h4/a'

Run this example in the YQL console

Table of Contents