Scraping HTML documents that require POST data with YQL

YQL is a great tool to scrape HTML from the web and turn it into data to reuse. This is not an illegal act as it can be very useful to reuse information maintained for example on a blog. My personal portfolio page gets most of its data from my blog hosted elsewhere.

Using the in-built YQL table for html allows you to scrape any HTML that allows the YQL server to access it (some sites modify robots.txt to prevent that which is something we comply with). For example, the homepage:

select * from html where url=""

Try this in the console.

The great thing about using this versus simply using cURL to load the data is that YQL runs the result through HTML Tidy to turn it into XML compliant data and removes badly encoded characters, which can be a big nuisance. The other great feature is that you can use XPATH to filter down the data to what you need. If we want all the links of the homepage we can use this:

select * from html where url="" and xpath="//a"

Try this in the console.

One thing that is not that known is that if you only want the text content of an element and still keep the element structure, you can select the content instead of the * wildcard:

select content from html where url="" and xpath="//a"

Try this in the console.

This is all cool and nice, but the problem is that when you need to send POST data to an HTML document before you scrape it you cannot use YQL - as you can't send POST data on the URL. The workaround is to write an open data table with an execute block that does this job for you.

You can use this new table like this:

select * from htmlpost where
and postdata="foo=foo&bar=bar" and xpath="//p"

Try it in the console.

There is a detailed write-up about the why and how of this table available, but here is the excerpt of the table source that is the most important:

var myRequest =;
var data = myRequest.accept('text/html').
var xdata = y.xpath(data,xpath);
response.object =

You define a new request and chain all the necessary data in a single line of JavaScript. As the YQL Execute code runs on the server you have a more powerful API than in the browser, so you can determine that you want html back, that you send the request as a form submission and simply add the POST data as a parameter of the post() method. You can run any xpath transformation over the returned data using the xpath() method. Every script in the execute block should return an object with the XML data and as execute allows for E4X you don't need to mess around with DOM generation of nodes.

This is just one example of the power of YQL Execute, please think up more cases that need solutions and have a go yourself. Submit your table to the GitHub repository or tell us about it on the forums.

Chris Heilmann
Yahoo Developer Network