A key feature of YQL is the ability to access data from structured data feeds such as RSS
and ATOM. However, if no such feed is available, you can use the html table to get HTML from a page. The html table can parse HTML 4.01 or HTML5 and be used with
XPath to extract portions of the HTML page.
The html table by default parses any HTML page using HTML 4.01 specification, but it can be configured to parse using the HTML5 specification.
To parse using HTML5 specification, you assign the compat key the value "html5" as seen in the first example below. When using the HTML5 specification to parse, the returned response may be slightly different.
For example, to get HTML from Yahoo! Groups that is parsed according the the HTML5 specification, you
would use the following YQL statement containing compat="html5":
select * from html where url="select * from html where url="http://groups.yahoo.com/search?query=surfing&sort=relevance" and
compat="html5"" and compat="html5"
Run this example in the YQL Console
The same statement without compat="html" returns HTML parsed according to the HTML 4.01 specification:
select * from html where url="http://groups.yahoo.com/search?query=surfing&sort=relevance"
The html table without XPath returns all of the page's HTML, which may not be useful in
an application. By adding an XPath expression to the statement, you can retrieve specific
portions of the HTML page.
The XPath expression in the following statement traverses through the
nodes in the HTML page to isolate the latest headlines. In this case, the XPath expression looks
first for a li tag with the class hbox groupsSearch-result-entry.
select * from html where url="http://groups.yahoo.com/search?query=surfing&sort=relevance" and compat="html5" and xpath='//li[contains(@class,"hbox
groupsSearch-result-entry")]'
Run this example in the YQL console
The following statement also gets information about Yahoo! Groups, but traverses the nodes to get links to the different groups:
select * from html where url="http://groups.yahoo.com/search?query=surfing&sort=relevance" and compat="html5" and xpath='//li[contains(@class,"hbox
groupsSearch-result-entry")]/h4/a'
Instead of the the wildcard asterisk (*), you can specify a particular element to process. To get just the content from an
HTML page, you can specify content
keyword after the word select. A statement with the
content keyword processes the HTML in the following order:
textContent.
For example, the following statement extracts only the HTML links
(href tags) for Yahoo! Groups:
select href from html where url="http://groups.yahoo.com/search?query=surfing&sort=relevance" and compat="html5" and xpath='//li[contains(@class,"hbox
groupsSearch-result-entry")]/h4/a'
Run this example in the YQL console
The following statement, for example, returns the textContent of each
anchor (a) tag retrieved by the XPath expression:
select content from html where url="http://groups.yahoo.com/search?query=surfing&sort=relevance" and compat="html5" and xpath='//li[contains(@class,"hbox
groupsSearch-result-entry")]/h4/a'