Opening the web and retrieving all the goodies

The internet is an interesting thing, as it is a bit like the matrix. Whilst normal end users see something like this:

thedailypuppy.com

Developers have the more outside-the-matrix point of view as we tend to look at the data behind the facade:

thedailypuppy.com with source code

And if you are one of the true believers in web2.0/web3.0 where the web is the platform and the framework then it turns into something like this:

wall of sweets

There is nothing better than yummy yummy data that you can retrieve, mix with the right other ingredients and spices to create something that is even healthier, more nutritious or even caters for special diets. In essence, giving access to data will make your product all the more successful as other chefs can cater for you.

Getting to the yummy parts of one or several sources can be a bit of an problem though. Imagine a tin of good solid food you want to get to. The easiest and most versatile tool would be a swiss army knife with a can opener.

swiss army knife

The web equivalent of a pocket knife is cURL, a library that allows a developer to make scripts behave like a browser and get access to the source of any web site or web service. You can for example go to the command line and simply enter the following:

curl --url http://www.thedailypuppy.com

The result is the source code of the page that you could run through other commands to get to the bits you want to retrieve.

The same works for RSS feeds or other types of data:

curl --url http://thedailypuppy.com/rss

cURL is amazingly powerful when you know how to use it - you can simulate other user agents, send and retrieve data, even spoof cookies. However, just like with the swiss army knife you'll have to put a lot of work and effort into getting to the goodies. Regular Expressions are most likely the most versatile way to do it and when it comes to being a developer they are not the first thing to go into your head easily.

What the web needed was a very fast, electrical can opener that also might be coupled with a microwave to pre-heat your dish. The equivalent for that would be Yahoo Pipes.

Cool electrical can opener

Yahoo Pipes is amazingly powerful as it gives you a very handy and beautiful interface to remix the web:

Pipes interface

This pipe for example searches twitter.com for my name and filters common false positives.
The outcome of your pipe laying is then available as a very simple URL that can take parameters and give you the output in a lot of different formats:

If that is too low-level for you and all you wanted to do is show a badge that you can change the look and feel, this is possible, too:

Pipes Badge options

And this is where it got tricky. Whenever you build an interface that is beautiful, intuitive and terribly powerful you will get one request: can we have a command line interface to this. This is just how developers roll, there is not much we can do about it.

The other issue with Pipes is that it is high maintenance to some degree. Whilst you can provide parameters, it is still a very graphical interface that is impossible to use for somebody who for example cannot use a mouse or see the interface. This might not be a large group, but in the end I myself find using a keyboard tool like Quicksilver for example easier than dragging and dropping and using my mouse a lot. When you want to change the functionality of a pipe beyond parameters then you'll need to go back to the editor, something that made several people unhappy, too. In other words, we needed a good, sturdy can opener that doesn't need batteries.

This is where the newest tool to open the web comes in: Yahoo Query Language or short YQL. With YQL you have a SQL style syntax to get very detailed information from all the services Yahoo offers the world and you can also access the web through it.

The main thing to try out YQL is the interactive console at https://developer.yahoo.com/yql/console/. There you can select from a lot of demo queries and you can see the outcome live below your query.

The YQL console

The real power of YQL lies in using and mixing Yahoo services and - with authentication - the Yahoo Social graph. However, for now let's just look at another thing to do: remix the web. If you scroll down on the right hand side you'll find "Available Data Tables" and there is a "data" sub-menu with the items atom, csv, feed, html, json, rss and xml.

This can be used to create YQL queries for anything on the web. Say for example you only want the names of the latest dailypuppy.com puppies, this can be done with the statement select title from feed where url='http://feeds.feedburner.com/TheDailyPuppy' and wrapped in the correct REST call it becomes:

http://query.yahooapis.com/v1/public/yql?q=select%20title%20from%20feed%20where%20url%3D'http%3A%2F%2Ffeeds.feedburner.com%2FTheDailyPuppy'&format=xml

Notice that you need to add a "public" before the yql to use the information without authentication!

If you want the data in JSON and wrapped in a function called myPuppies, just add the correct parameters called format and callback:

http://query.yahooapis.com/v1/public/yql?q=select%20title%20from%20feed%20where%20url%3D'http%3A%2F%2Ffeeds.feedburner.com%2FTheDailyPuppy'&format=json&callback=myPuppies

Where it gets really interesting is the html option. Whilst Pipes has the option to retrieve an HTML document and get it as a string, YQL went further and actually allows you to use XPATH queries over the HTML document. Say you want to get all the latest images in my blog posts. You could use select * from html where url="http://www.wait-till-i.com" and xpath='//div[@id="content"]//img' for this:

http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url%3D%22http%3A%2F%2Fwww.wait-till-i.com%22%20and%20xpath%3D'%2F%2Fdiv%5B%40id%3D%22content%22%5D%2F%2Fimg'&format=xml

The opportunities are endless, especially once you dive deeper into the YQL documentation and learn about joining queries.

Want more? Comment about your needs and wishes :)

Chris Heilmann
Yahoo Developer Network