YQL is a great tool to scrape HTML from the web and turn it into data to reuse. This is not an illegal act as it can be very useful to reuse information maintained for example on a blog. My personal portfolio page http://icant.co.uk gets most of its data from my blog hosted elsewhere.
Using the in-built YQL table for html allows you to scrape any HTML that allows the YQL server to access it (some sites modify robots.txt to prevent that which is something we comply with). For example, the cnn.com homepage:
select * from html where url="http://cnn.com"
The great thing about using this versus simply using cURL to load the data is that YQL runs the result through HTML Tidy to turn it into XML compliant data and removes badly encoded characters, which can be a big nuisance. The other great feature is that you can use XPATH to filter down the data to what you need. If we want all the links of the cnn.com homepage we can use this:
select * from html where url="http://cnn.com" and xpath="//a"
One thing that is not that known is that if you only want the text content of an element and still keep the element structure, you can select the content instead of the * wildcard:
select content from html where url="http://cnn.com" and xpath="//a"
This is all cool and nice, but the problem is that when you need to send POST data to an HTML document before you scrape it you cannot use YQL - as you can't send POST data on the URL. The workaround is to write an open data table with an execute block that does this job for you.
You can use this new table like this:
select * from htmlpost where url='http://isithackday.com/hacks/htmlpost/index.php' and postdata="foo=foo&bar=bar" and xpath="//p"
There is a detailed write-up about the why and how of this table available, but here is the excerpt of the table source that is the most important:
var myRequest = y.rest(url); var data = myRequest.accept('text/html'). contentType("application/x-www-form-urlencoded"). post(postdata).response; var xdata = y.xpath(data,xpath); response.object =
This is just one example of the power of YQL Execute, please think up more cases that need solutions and have a go yourself. Submit your table to the GitHub repository or tell us about it on the forums.
Yahoo Developer Network