- Sometimes YQL can't fetch data from websites, because they restrict it with robots.txt.
Solution: use your own fetcher. - Sometimes YQL IP gets blocked by websites (like twitter api is now) and you can't get data from them.
Solution: use your own IP. - When fetching a web page, HTML returned by YQL is often modified (cleaned-up?), with some elements removed that often contain the data you're looking for.
Solution: use your own HTML parser. - All your queries and source code of your Table / Execute scripts source is public. So it can't include any secret business logic or private information.
Solution: proxy your YQL queries through Pipes and use a Private String module to hide your YQL query and make it visible only for you. Or make your own proxy. - Sometimes your table gets blocked and inaccessible due to too many requests or too long execution.
Solution: when your table gets blocked, add random number to your table (say table.xml?123), which YQL loads as a new table. Set your scripts to add a random number from 1 to 10 to run 10 clones of your table thus distributing limimts throught 10 tables. - Your YQL tables and execute scripts are cached, so even if you change your table.xml or a Execute script within it, it will use the old version since the cache expires.
Solution: add random number to urls of your tables and all included scripts within Execute - YQL http cache expires in 1 minute. In other words, YQL will keep your http requests in a cache for 1 minute and try to fetch new data only when this time expires. It's good, if you want to always get fresh data. But it also increases response time of your YQL query.
Solution: If you want to get new version of data more frequently, just add a timestamp (like &1247132010) at the end of your url. If you want a longer cache, set up your own cache. - YQL do not cache its response and executes everytime it's called. It's good if your query if simple and fast and should generate new data everytime its called (like returning data based on current date). But if your query is very complex (say does some complex filtering with JavaScript that takes 20s to execute), it's a complete waste of user time to generate the same response with every single request.
Solution: proxy your request through Pipes, which will cache YQL reponse for ~1-2 hours so you get your data from cache in 0.4 seconds (average Pipes and YQL response time) instead of waiting 20s for it to generate. Again, if you need to control caching, you can add a timestamp to your urls. - Creating Open Tables is cumbersome. Its much easier and flexible to use YQL as a Server-Side JavaScript execution service that would get javascript code to execute and return the results.
Solution: create a single Open Table that would get a single "code" parameter executes it with eval(code) and returns results. Or create such service on your host. - YQL may change anything they like anytime they like: api, terms, pricing, request/bandwidth limits, start caching YQL response, extend cache for http requests, etc...
Solution: create your own server-side javascript service.
Currently I'm considering two solutions for these problems:
- Proxy service. Create a simple script on App Engine that should receive YQL query and cache interval (like 1 hour), get data from YQL and cache it using memcache. It should be a lot faster than YQL too. App Engine's response time is 150-200ms, while YQL's 400-450ms. It should solve 4, 7, 8 problems.
- Server-Side JavaScript service. Create a script on App Engine Java+Rhino that should get javascript code to execute and return its results. Its lots more powerful and faster than YQL, without any YQL limits. You get all Google App Engine apis, power and speed, and flexible pricing. You run on your domain. You have complete control how a script behave.
Please share your thoughts below.
While I'm not really a fan of going through these one by one, let me try and address these.
1. Yes, we respect robots.txt and the rights of the content provider. Ignoring the intent of the provider is abusive and will only lead to problems for any application. Its not hard to block IPs or user agents.
2. Although have agreements with certain larger web sites, like Google (and obviously Yahoo web services have no such issues), this is an concern we are aware of. Twitter is especially tricky since they say one thing but the behavior of their system is different. We do forward the "last known valid" IP address connecting to YQL in the x-forwarded-for HTTP header, which web API providers could use instead of the IP's of our proxies. Unfortunately this is something of an education problem rather than a technical one. We send traffic from only a small number of proxies so providers would only need to add them to a list of IPs they would trust to be providing valid (unspoofed) x-forwarded-for header values, and use the IP specified there. We're also examining letting developers using YQL specify their own proxy servers, which I think would fit very well into your current model.
3. Yep, we clean the HTML up for parsing. Its based on tidy, which most people would use too.
4. "Secrets" can be separated in a number of ways - you could pass them as query parameters (which are substituted into the same parameter name @name when the YQL statement is parsed) when using the table over https for example. However, there are definitely situations where you don't want people to see the table definition and/or pass the secrets across, and its something we're working on too.
5. It gets blocked for a reason - its going over the unit or time quota we have set. Tables don't get blocked otherwise. Too many table blocks will result in an IP block too.
6. Yeah, it annoys us to. We've got a simple fix in the works.
7. Caching by *default* is 5 minutes. You can fully configure this in your open data tables by specifying the cache-control headers yourself.
8. Yep, thats not good for us either. We're working on it.
9. Right. Open Data Tables are designed to provide logical sets (or rows) of data, not to be an app engine. Are you thinking of something like "stored procedures" for your needs?
10. Sure, you can build it yourself.
Remember YQL is an evolving product. We're not "done". We don't just push it out once a year, but make changes continuously. Sometimes the new capabilities we add require other great functionality to come later. If there is something we can add that makes sense we'll add it. Why not let us know what's up with the HTML tidy and we'll see what we can do.
Thanks for your questions and concerns. I hope my answers help. Its good, from our perspective, to see that the things we have identified ourselves as needing addressing, will be valuable.
Jonathan