YQL is a great tool to scrape HTML from the web and turn it into data to reuse. This is not an illegal act as it can be very useful to reuse information maintained for example on a blog. My personal portfolio page http://icant.co.uk gets most of its data from my blog hosted elsewhere.
Using the in-built YQL table for html allows you to scrape any HTML that allows the YQL server to access it (some sites modify robots.txt to prevent that which is something we comply with). For example, the cnn.com homepage:
select * from html where url="http://cnn.com"
The great thing about using this versus simply using cURL to load the data is that YQL runs the result through HTML Tidy to turn it into XML compliant data and removes badly encoded characters, which can be a big nuisance. The other great feature is that you can use XPATH to filter down the data to what you need. If we want all the links of the cnn.com homepage we can use this:
Read More »from Scraping HTML documents that require POST data with YQL
select * from html where