0

Overcoming YQL barriers for large projects

I was super enthusiastic about YQL and was considering it for my next project. But the more I used it, the more barriers I met. I've tried to find workarounds for them, but as they're piling, I'm starting to think that YQL is more like a toy and not suitable for any larger project.

I know that I'm not alone. So here is a list of the biggest problems I encountered while developing with YQL and possible solutions to them. Please share your experience using YQL for your projects.

It would be nice to get some thoughts on using YQL for large projects from people behind YQL itself.

  1. Sometimes YQL can't fetch data from websites, because they restrict it with robots.txt.
    Solution: use your own fetcher.
  2. Sometimes YQL IP gets blocked by websites (like twitter api is now) and you can't get data from them.
    Solution: use your own IP.
  3. When fetching a web page, HTML returned by YQL is often modified (cleaned-up?), with some elements removed that often contain the data you're looking for.
    Solution: use your own HTML parser.
  4. All your queries and source code of your Table / Execute scripts source is public. So it can't include any secret business logic or private information.
    Solution: proxy your YQL queries through Pipes and use a Private String module to hide your YQL query and make it visible only for you. Or make your own proxy.
  5. Sometimes your table gets blocked and inaccessible due to too many requests or too long execution.
    Solution: when your table gets blocked, add random number to your table (say table.xml?123), which YQL loads as a new table. Set your scripts to add a random number from 1 to 10 to run 10 clones of your table thus distributing limimts throught 10 tables.
  6. Your YQL tables and execute scripts are cached, so even if you change your table.xml or a Execute script within it, it will use the old version since the cache expires.
    Solution: add random number to urls of your tables and all included scripts within Execute
  7. YQL http cache expires in 1 minute. In other words, YQL will keep your http requests in a cache for 1 minute and try to fetch new data only when this time expires. It's good, if you want to always get fresh data. But it also increases response time of your YQL query.
    Solution: If you want to get new version of data more frequently, just add a timestamp (like &1247132010) at the end of your url. If you want a longer cache, set up your own cache.
  8. YQL do not cache its response and executes everytime it's called. It's good if your query if simple and fast and should generate new data everytime its called (like returning data based on current date). But if your query is very complex (say does some complex filtering with JavaScript that takes 20s to execute), it's a complete waste of user time to generate the same response with every single request.
    Solution: proxy your request through Pipes, which will cache YQL reponse for ~1-2 hours so you get your data from cache in 0.4 seconds (average Pipes and YQL response time) instead of waiting 20s for it to generate. Again, if you need to control caching, you can add a timestamp to your urls.
  9. Creating Open Tables is cumbersome. Its much easier and flexible to use YQL as a Server-Side JavaScript execution service that would get javascript code to execute and return the results.
    Solution: create a single Open Table that would get a single "code" parameter executes it with eval(code) and returns results. Or create such service on your host.
  10. YQL may change anything they like anytime they like: api, terms, pricing, request/bandwidth limits, start caching YQL response, extend cache for http requests, etc...
    Solution: create your own server-side javascript service.


Currently I'm considering two solutions for these problems:
  1. Proxy service. Create a simple script on App Engine that should receive YQL query and cache interval (like 1 hour), get data from YQL and cache it using memcache. It should be a lot faster than YQL too. App Engine's response time is 150-200ms, while YQL's 400-450ms. It should solve 4, 7, 8 problems.
  2. Server-Side JavaScript service. Create a script on App Engine Java+Rhino that should get javascript code to execute and return its results. Its lots more powerful and faster than YQL, without any YQL limits. You get all Google App Engine apis, power and speed, and flexible pricing. You run on your domain. You have complete control how a script behave.


Please share your thoughts below.

by
6 Replies
  • QUOTE (Aurimas Rimsa @ Jul 9 2009, 02:52 AM) <{POST_SNAPBACK}>
    1. Sometimes YQL can't fetch data from websites, because they restrict it with robots.txt.
      Solution: use your own fetcher.
    2. Sometimes YQL IP gets blocked by websites (like twitter api is now) and you can't get data from them.
      Solution: use your own IP.
    3. When fetching a web page, HTML returned by YQL is often modified (cleaned-up?), with some elements removed that often contain the data you're looking for.
      Solution: use your own HTML parser.
    4. All your queries and source code of your Table / Execute scripts source is public. So it can't include any secret business logic or private information.
      Solution: proxy your YQL queries through Pipes and use a Private String module to hide your YQL query and make it visible only for you. Or make your own proxy.
    5. Sometimes your table gets blocked and inaccessible due to too many requests or too long execution.
      Solution: when your table gets blocked, add random number to your table (say table.xml?123), which YQL loads as a new table. Set your scripts to add a random number from 1 to 10 to run 10 clones of your table thus distributing limimts throught 10 tables.
    6. Your YQL tables and execute scripts are cached, so even if you change your table.xml or a Execute script within it, it will use the old version since the cache expires.
      Solution: add random number to urls of your tables and all included scripts within Execute
    7. YQL http cache expires in 1 minute. In other words, YQL will keep your http requests in a cache for 1 minute and try to fetch new data only when this time expires. It's good, if you want to always get fresh data. But it also increases response time of your YQL query.
      Solution: If you want to get new version of data more frequently, just add a timestamp (like &1247132010) at the end of your url. If you want a longer cache, set up your own cache.
    8. YQL do not cache its response and executes everytime it's called. It's good if your query if simple and fast and should generate new data everytime its called (like returning data based on current date). But if your query is very complex (say does some complex filtering with JavaScript that takes 20s to execute), it's a complete waste of user time to generate the same response with every single request.
      Solution: proxy your request through Pipes, which will cache YQL reponse for ~1-2 hours so you get your data from cache in 0.4 seconds (average Pipes and YQL response time) instead of waiting 20s for it to generate. Again, if you need to control caching, you can add a timestamp to your urls.
    9. Creating Open Tables is cumbersome. Its much easier and flexible to use YQL as a Server-Side JavaScript execution service that would get javascript code to execute and return the results.
      Solution: create a single Open Table that would get a single "code" parameter executes it with eval(code) and returns results. Or create such service on your host.
    10. YQL may change anything they like anytime they like: api, terms, pricing, request/bandwidth limits, start caching YQL response, extend cache for http requests, etc...
      Solution: create your own server-side javascript service.


    Currently I'm considering two solutions for these problems:
    1. Proxy service. Create a simple script on App Engine that should receive YQL query and cache interval (like 1 hour), get data from YQL and cache it using memcache. It should be a lot faster than YQL too. App Engine's response time is 150-200ms, while YQL's 400-450ms. It should solve 4, 7, 8 problems.
    2. Server-Side JavaScript service. Create a script on App Engine Java+Rhino that should get javascript code to execute and return its results. Its lots more powerful and faster than YQL, without any YQL limits. You get all Google App Engine apis, power and speed, and flexible pricing. You run on your domain. You have complete control how a script behave.


    Please share your thoughts below.


    While I'm not really a fan of going through these one by one, let me try and address these.

    1. Yes, we respect robots.txt and the rights of the content provider. Ignoring the intent of the provider is abusive and will only lead to problems for any application. Its not hard to block IPs or user agents.

    2. Although have agreements with certain larger web sites, like Google (and obviously Yahoo web services have no such issues), this is an concern we are aware of. Twitter is especially tricky since they say one thing but the behavior of their system is different. We do forward the "last known valid" IP address connecting to YQL in the x-forwarded-for HTTP header, which web API providers could use instead of the IP's of our proxies. Unfortunately this is something of an education problem rather than a technical one. We send traffic from only a small number of proxies so providers would only need to add them to a list of IPs they would trust to be providing valid (unspoofed) x-forwarded-for header values, and use the IP specified there. We're also examining letting developers using YQL specify their own proxy servers, which I think would fit very well into your current model.

    3. Yep, we clean the HTML up for parsing. Its based on tidy, which most people would use too.

    4. "Secrets" can be separated in a number of ways - you could pass them as query parameters (which are substituted into the same parameter name @name when the YQL statement is parsed) when using the table over https for example. However, there are definitely situations where you don't want people to see the table definition and/or pass the secrets across, and its something we're working on too.

    5. It gets blocked for a reason - its going over the unit or time quota we have set. Tables don't get blocked otherwise. Too many table blocks will result in an IP block too.

    6. Yeah, it annoys us to. We've got a simple fix in the works.

    7. Caching by *default* is 5 minutes. You can fully configure this in your open data tables by specifying the cache-control headers yourself.

    8. Yep, thats not good for us either. We're working on it.

    9. Right. Open Data Tables are designed to provide logical sets (or rows) of data, not to be an app engine. Are you thinking of something like "stored procedures" for your needs?

    10. Sure, you can build it yourself.

    Remember YQL is an evolving product. We're not "done". We don't just push it out once a year, but make changes continuously. Sometimes the new capabilities we add require other great functionality to come later. If there is something we can add that makes sense we'll add it. Why not let us know what's up with the HTML tidy and we'll see what we can do.

    Thanks for your questions and concerns. I hope my answers help. Its good, from our perspective, to see that the things we have identified ourselves as needing addressing, will be valuable.

    Jonathan
    0
  • Wow, thank you for clearing things up. Seems I've gotten some things wrong about table blocking and http request caching. Also, I was approaching Open Tables from a wrong angle. I was so excited about Execute element, that I've missed that everything I was trying to implement with JavaScript can be accomplished much easier with just YQL queries and the right tables.

    I think I will give it another shot. It's just too much fun to develop cool new things in just minutes using nothing but an html+css+js and whole web as a database. Though lack of response caching and private tables are still the biggest drawbacks for developing anything more serious. It's great to know that you're working on these issues.
    0
  • To address most of these issues, I'm taking the Server Side JavaScript approach. This way, you can run (and hide) whatever is necessary on the server side and serve plain HTML to the browser (and search engines). Here's a sample using Jaxer on Aptana's Cloud offering, including YQL and the PURE rendering engine.

    http://bit.ly/YQLive

    In the initial run, the entire page gets rendered on the server. Then every 10 seconds there's a refresh performing a YQL query and rendering on the server in a so called Sandbox, which returns the rendered DOM fragment back to the browser. The sources are available in separate links, because all server side stuff is hidden (which is good!).

    -Ivo


    Building blocks:
    - http://www.aptana.com/cloud
    - http://beebole.com/pure
    0
  • I was looking into Jaxer recently and was absolutely blown away by how easy it is. But the fact that you have to use Aptana Cloud held me back from sticking with it. It simply can't compare with Google AppEngine + Rhino neither by price nor by performance. Though GAE+Rhino needs some setting up at the beginning, but when you go through that, you get a pretty decent server-side JavaScript platform: a true cloud, Google scalability and performance, you are free to start and get very flexible pricing as you get bigger. If only someone could port Jaxer or recently announced JoyEnt Smart Platform to Google App Engine, it would make an awesome platform for Server-Side JavaScript.
    0
  • QUOTE (Aurimas Rimsa @ Jul 10 2009, 06:10 AM) <{POST_SNAPBACK}>
    But the fact that you have to use Aptana Cloud held me back from sticking with it.


    Jaxer is Open Source Software and you're free to run it in your own cloud. Aptana's Cloud offering - which includes Jaxer - is just a matter of convenience and a valid business model for Aptana. When living in the EU and having to pay 20 USD/month for convenience is a price for a deluxe cup of coffee.

    So just consider it as a use case to circumvent most of the YQL issues that you mentioned. The main message here is that you can put EVERYTHING on the client (at least from a mental and programming perspective).

    -Ivo

    http://www.aptana.com/jaxer
    0
  • Yes, I know that it's open source and actualy was considering using it on EC2 (since you can't use it on App Engine, right?). But maybe I'm little spoiled by App Engine, because EC2 seems too complex and too expensive for startups. Though I agree that $20 is really a bargain for all the possibilities and convenience that Aptana Jaxer + Cloud offers. What wonders me is how it scales. I see that Jaxer looks pretty good in benchmarks, but I'm curious how it can handle large spikes of traffic. There isn't any clear answer in their support forums, so it seems that I should try to find it out by myself. After all Jaxer got me very excited and I surely will continue playing and experimenting with it.
    0

Recent Posts

in YQL