0

"robots.txt disallows request"

Attempting this query:
QUOTE
select * from html where url="http://www.justgiving.com/meningitisukswim" and xpath='//span[@class="by-time"][0]'


I get the response:
QUOTE
robots.txt for that domain disallows crawling for that url


Yet if I check robots.txt at the domain directly, I get:
QUOTE
User-agent: *
Disallow: /pfp/

Which to me shouldn't ban the query.

I have tried this over several times over a couple of days, so I don't think Y! has a cached robots.txt from before.
Anyone any ideas?

by
6 Replies
  • It's probably due to the site's slowness and generating errors. Sometimes it gives proper results:

    CODE
        <results>
    <span class="by-time" id="ctl00_cphMain__donationTable__donationTableRepeater_ctl00__name">Donation by <strong>vicky smith</strong>
    </span>
    <span class="by-time" id="ctl00_cphMain__donationTable__donationTableRepeater_ctl01__name">Donation by <strong>Simon Thompson</strong>
    </span>
    </results>


    By the way, the first-child occurrence in xpath starts with 1 instead of 0, but this doesn't have any effect on the result payload.
    0
  • QUOTE
    By the way, the first-child occurrence in xpath starts with 1 instead of 0


    Thanks a bunch! Access seems to be working again now.
    0
  • QUOTE (alexanderwhowell @ Jun 22 2009, 09:35 AM) <{POST_SNAPBACK}>
    Thanks a bunch! Access seems to be working again now.


    That doesn't seem to be the end of the story. While the console query shows that publiclyCallable is set to true, the REST query has "robots.txt for that domain disallows crawling for that url".
    0
  • Is this a 'problem' with the server (justgiving) or YQL?
    I notice that http://www.justgiving.com/robots.txt is a 403 to http://original.justgiving.com- perhaps the redirect is screwing things over?
    0
  • QUOTE (hapdaniel @ Jun 25 2009, 03:58 AM) <{POST_SNAPBACK}>
    That doesn't seem to be the end of the story. While the console query shows that publiclyCallable is set to true, the REST query has "robots.txt for that domain disallows crawling for that url".


    These are two different things - publiclyCallable just means that you can send this request to YQLs public (not oauth) entrypoint. The robots.txt is what happens AFTER yql starts running it.
    0
  • QUOTE (Jonathan @ Jun 26 2009, 06:52 PM) <{POST_SNAPBACK}>
    These are two different things - publiclyCallable just means that you can send this request to YQLs public (not oauth) entrypoint. The robots.txt is what happens AFTER yql starts running it.


    Jonathan,
    Thanks for the explanation of publiclyCallable. I'm still not clear as to why the console query works but the REST query doesn't.
    0

Recent Posts

in YQL