0

"select * from html where xpath=" fails without apparent reason

The following query

select * from html where url="http://wait-till-i.com" and xpath='//a[@rel="tag"]'

produces the expected result (about 30 links). Try it:
http://query.yahooapis.com/v1/public/yql?q=select * from html where url="http://wait-till-i.com" and xpath='//a[@rel="tag"]' &format=xml

However, a similar query for other domains, for example

select * from html where url="http://userscripts.org/" and xpath='//a[@rel="nofollow"]'

throws an empty result set, which is not what I would expect. Try it:
http://query.yahooapis.com/v1/public/yql?q=select * from html where url="http://userscripts.org/" and xpath='//a[@rel="nofollow"]' &format=xml

I was experimenting with "select * from html" and different xpath expressions in several pages:

http://wiki.greasespot.net/Main_Page
http://www.greasespot.net/
http://userscripts.org/users/2308
http://userscripts.org/

In all cases I got empty result sets, although a direct test of the xpath from javascript would always return the expected results.

Furthermore:

http://userscripts.org/
* is valid xhtml
* is served as text/html;utf-8 and it is indeed

whereas http://wait-till-i.com
* is invalid html 4.01 Strict
* is also served as text/html;utf-8, but it contains non utf-8 characters.

Am I missing something obvious, or is this a bug?

by
  • x
  • Dec 13, 2008
8 Replies
  • Try removing the trailing slash from the second domain name,


    QUOTE (esquifit @ Dec 13 2008, 07:29 AM) <{POST_SNAPBACK}>
    The following query

    select * from html where url="http://wait-till-i.com" and xpath='//a[@rel="tag"]'

    produces the expected result (about 30 links). Try it:
    http://query.yahooapis.com/v1/public/yql?q=select * from html where url="http://wait-till-i.com" and xpath='//a[@rel="tag"]' &format=xml

    However, a similar query for other domains, for example

    select * from html where url="http://userscripts.org/" and xpath='//a[@rel="nofollow"]'

    throws an empty result set, which is not what I would expect. Try it:
    http://query.yahooapis.com/v1/public/yql?q=select * from html where url="http://userscripts.org/" and xpath='//a[@rel="nofollow"]' &format=xml

    I was experimenting with "select * from html" and different xpath expressions in several pages:

    http://wiki.greasespot.net/Main_Page
    http://www.greasespot.net/
    http://userscripts.org/users/2308
    http://userscripts.org/

    In all cases I got empty result sets, although a direct test of the xpath from javascript would always return the expected results.

    Furthermore:

    http://userscripts.org/
    * is valid xhtml
    * is served as text/html;utf-8 and it is indeed

    whereas http://wait-till-i.com
    * is invalid html 4.01 Strict
    * is also served as text/html;utf-8, but it contains non utf-8 characters.

    Am I missing something obvious, or is this a bug?
    0
  • I'm having the same problem. Recently tried a simple query like this one "select * from html where url="http://qvister.se" and xpath='//a[@rel="external"]'" and it returned zero results. Am I doing anything wrong?
    0
  • Same for me. I tried select * from html where url="http://www.luissquall.com" and xpath='//div[@class="post"]'
    0
  • Ditto. Maybe they're doing some work on it?
    0
  • QUOTE (snpcrcklep0p @ Dec 15 2008, 09:35 AM) <{POST_SNAPBACK}>
    Ditto. Maybe they're doing some work on it?


    We're working on making this more robust.

    Jonathan
    0
  • It doesn't seem as it works with xHTML, but only plain ol' HTML?!
    Go to the YQL console and try for example:

    select * from html where url='http://www.nyfikon.se/index.html' and xpath='//div'

    vs.

    select * from html where url='http://www.nyfikon.se/index2.html' and xpath='//div'

    Please fix this! :)/Jacob
    0
    • x
    • Dec 17, 2008
    QUOTE (litenjacob @ Dec 16 2008, 12:54 PM) <{POST_SNAPBACK}>
    It doesn't seem as it works with xHTML, but only plain ol' HTML?!


    Seems to be the case indeed. As a interim solution, since XHTML is well-formed XML, one can do something like:

    select * from xml where url='http://www.nyfikon.se/index.html' and itemPath='html.body.div'

    This works. Of course the path notation is no substitue for XPath, but better as nothing. In particular you must specify the whole path beginning with the root node, which is not always possible.
    0
    • x
    • Dec 22, 2008
    Hey, the YQL guys are working hard! :)Thank you and keep up the good work!
    0

Recent Posts

in YQL