0

500 'Server not Found' after 'select * from html'

The following

select * from html where url = "http://userscripts.org/"

invariably returns a 500 'Server not Found' error. The same for any
page under the userscripts.org domain. With any other url it continues to work fine.

The above used to work until a couple of weeks ago. I didn't change anything, it just suddenly stopped working. I contacted the server admin of userscripts.org, he didn't change anything either. I'm clueless.

by
  • x
  • Jul 26, 2009
7 Replies
    • x
    • Jul 27, 2009
    The problem exists in other domains, also. I thought that the cause could be the crawler being banned by robots.txt, but it turns out that when making a select from something below some path which is disallowed in robot.txt, the error message is different, as shown in the following examples.

    A) Domain: www.greasespot.net
    robots.tx:

    User-agent: *
    Disallow: /search


    1) select * from html where url="http://www.greasespot.net"
    Works fine.

    2) select * from html where url="http://www.greasespot.net/search"

    Returns "Error Retrieving Data from External Service"
    <forbidden>robots.txt for that domain disallows crawling for that url</forbidden>

    B) Domain: userscripts.org
    robots.tx:

    User-agent: *
    Disallow: /scripts/source/
    Disallow: /scripts/version/
    Disallow: /scripts/diff/
    Disallow: /users
    Disallow: /reviews/new
    Disallow: /posts/preview


    1) select * from html where url="http://userscripts.org/"

    url error="Server returned HTTP response code: 500 for URL: http://userscripts.org/" execution-time="86" http-status-code="500" http-status-message="Internal Server Error"><![CDATA[http://userscripts.org/]]></url>


    2) select * from html where url="http://userscripts.org/scripts/version/"
    <forbidden>robots.txt for that domain disallows crawling for that url</forbidden>

    C) Domain: lang-8.con
    robots.txt:

    User-Agent: *
    Allow: /


    1) select * from html where url="http://lang-8.com/"

    <url error="Server returned HTTP response code: 500 for URL: http://lang-8.com/" execution-time="12" http-status-code="500" http-status-message="Internal Server Error"><![CDATA[http://lang-8.com/]]></url>
    0
  • QUOTE (esquifit @ Jul 27 2009, 11:22 AM) <{POST_SNAPBACK}>
    The problem exists in other domains, also. I thought that the cause could be the crawler being banned by robots.txt, but it turns out that when making a select from something below some path which is disallowed in robot.txt, the error message is different, as shown in the following examples.

    A) Domain: www.greasespot.net
    robots.tx:

    User-agent: *
    Disallow: /search


    1) select * from html where url="http://www.greasespot.net"
    Works fine.

    2) select * from html where url="http://www.greasespot.net/search"

    Returns "Error Retrieving Data from External Service"
    <forbidden>robots.txt for that domain disallows crawling for that url</forbidden>

    B)-- Nagesh
    0
    • x
    • Jul 28, 2009
    QUOTE (Nagesh Susarla @ Jul 27 2009, 01:39 PM) <{POST_SNAPBACK}>
    YQL Servers send the Client-ip header along with other headers as part of the request and it looks like the userscripts.org website doesnt like this header. A trivial curl request can reproduce the problem

    CODE
    curl -H "Client-ip: 68.180.220.12" http://userscripts.org


    -- Nagesh


    This seems to be the cause indeed. The same happens with lang-8.com. How did you know that?!
    Anyway, this behaviour must date back a couple of weeks because I'd never had this problem before. This effectively breaks what for me is the main point of using YQL at all. Is there some way to tell Yahoo not to send the problematic header?
    0
  • QUOTE (esquifit @ Jul 27 2009, 11:36 PM) <{POST_SNAPBACK}>
    This seems to be the cause indeed. The same happens with lang-8.com. How did you know that?!
    Anyway, this behaviour must date back a couple of weeks because I'd never had this problem before. This effectively breaks what for me is the main point of using YQL at all. Is there some way to tell Yahoo not to send the problematic header?


    This is a standard header we pass to all the outgoing requests to let the websites know the identity of the client. Currently, there is no way to tell the YQL servers not to send the header but i believe that you have two options:
    1) workaround the issue by proxying through an intermediary (request goes to an intermediate site which then forwards the request to userscripts w/o the header)
    2) Bring this to the attention of the website and get this addressed. It looks like a bug if a header can cause it to return an internal error.


    -- Nagesh
    0
    • x
    • Jul 28, 2009
    QUOTE (Nagesh Susarla @ Jul 28 2009, 08:17 AM) <{POST_SNAPBACK}>
    This is a standard header we pass to all the outgoing requests to let the websites know the identity of the client. Currently, there is no way to tell the YQL servers not to send the header but i believe that you have two options:
    1) workaround the issue by proxying through an intermediary (request goes to an intermediate site which then forwards the request to userscripts w/o the header)
    2) Bring this to the attention of the website and get this addressed. It looks like a bug if a header can cause it to return an internal error.


    I had already thought of workaround 1, but this is somewhat beyond my knowledge and technical possibilities. The second option is not really an option, as "it is not scalable". I just cannot contact every server I want to read a page from and expect them to change their configuration just to please me, and that in a reasonable amount of time.

    Then I found that this could be related to known a problem of Rails:

    Client-IP Rails issue - Ruby Forum
    http://www.ruby-forum.com/topic/168531

    #322 Don't return 500 if Client-IP and X-Forwarded-For agree. - Ruby on Rails - rails
    https://rails.lighthouseapp.com/projects/8994/tickets/322

    In fact, the servers I observed the problem are:

    userscripts-org: nginx + mod_rails/mod_rack
    lang-8: Mongrel

    What puzzles me a bit is that everything ceased to work suddenly a couple of weeks ago. Either Yahoo changed something in the way the request are sent, or everybody has updated to a version of Rails or of a Rails-based app that dislikes client-ip (or some combination of client-ip and other header/s).
    0
  • QUOTE (esquifit @ Jul 28 2009, 09:27 AM) <{POST_SNAPBACK}>
    I had already thought of workaround 1, but this is somewhat beyond my knowledge and technical possibilities. The second option is not really an option, as "it is not scalable". I just cannot contact every server I want to read a page from and expect them to change their configuration just to please me, and that in a reasonable amount of time.

    Then I found that this could be related to known a problem of Rails:

    Client-IP Rails issue - Ruby Forum
    http://www.ruby-forum.com/topic/168531

    #322 Don't return 500 if Client-IP and X-Forwarded-For agree. - Ruby on Rails - rails
    https://rails.lighthouseapp.com/projects/8994/tickets/322

    In fact, the servers I observed the problem are:

    userscripts-org: nginx + mod_rails/mod_rack
    lang-8: Mongrel

    What puzzles me a bit is that everything ceased to work suddenly a couple of weeks ago. Either Yahoo changed something in the way the request are sent, or everybody has updated to a version of Rails or of a Rails-based app that dislikes client-ip (or some combination of client-ip and other header/s).


    Thanks for finding this. We've been sending these headers for atleast the past 2 releases IIRC to adhere to our standards. Approximately since end of last month.
    The proxy approach (1) would be to have an intermediate php (for example) hosted on a website which takes the URL as the parameter and then just returns the curl output of that URL. In the query one would have to refer to this php hosted url and give it the query parameter to fetch the data.
    Unfortunately, there's no other solution that comes to mind right now.

    thanks,
    Nagesh
    0
    • x
    • Jul 29, 2009
    QUOTE (Nagesh Susarla @ Jul 28 2009, 03:28 PM) <{POST_SNAPBACK}>
    We've been sending these headers for atleast the past 2 releases IIRC to adhere to our standards. Approximately since end of last month.

    Ah, that explains everything.

    QUOTE (Nagesh Susarla @ Jul 28 2009, 03:28 PM) <{POST_SNAPBACK}>
    The proxy approach (1) would be to have an intermediate php (for example) hosted on a website which takes the URL as the parameter and then just returns the curl output of that URL. In the query one would have to refer to this php hosted url and give it the query parameter to fetch the data.
    Unfortunately, there's no other solution that comes to mind right now.

    Thanks for the hints. The problem here is that I don't own a web space where I can play with php or something like that. I used to use appjet (server side javascript) for this kind of things, but the guys shut up the service on 1st July 09. A non-free hosting service would be too much for my modest needs. I'll find the way somehow. Thank you again for your help.
    0

Recent Posts

in YQL