0

Bug report concerning scraping meta tags with "property" attribute

Hey,

I wasn't able to find a bug tracker for the YQL project so I'm hoping this is the right channel to report a bug. Please let me know if not.

I'm trying to scrape OpenGraph meta tags from websites.
Those meta tags look like this:

<meta property="og:title" content="lorem ipsum">

And my YQL query:
SELECT * FROM html WHERE url = 'http://domain.tld/foo' AND xpath='descendant-or-self::meta'

The problem is the response. It correctly contains the "content" attribute and its value but it doesn't contain the "property" attribute.
I suppose this is a bug.

Here's the query in action:
http://y.ahoo.it/mh9Vl

Cheers,
Christopher

by
4 Replies
  • Hi Christopher,

    I've created a ticket (4940602) for tracking purposes. Taking a look.

    Thanks,

    Jan

    QUOTE(Christopher @ 25 Oct 2011 3:17 AM)
    Hey,

    I wasn't able to find a bug tracker for the YQL project so I'm hoping this is the right channel to report a bug. Please let me know if not.

    I'm trying to scrape OpenGraph meta tags from websites.
    Those meta tags look like this:



    And my YQL query:
    SELECT * FROM html WHERE url = 'http://domain.tld/foo' AND xpath='descendant-or-self::meta'

    The problem is the response. It correctly contains the "content" attribute and its value but it doesn't contain the "property" attribute.
    I suppose this is a bug.

    Here's the query in action:
    http://y.ahoo.it/mh9Vl

    Cheers,
    Christopher
    0
  • Hey!

    sorry for pressuring but any news concerning this?
    Is there perhaps a workaround?

    Thanks.

    QUOTE(Jan Pipes @ 26 Oct 2011 5:40 PM)
    Hi Christopher,

    I've created a ticket (4940602) for tracking purposes. Taking a look.

    Thanks,

    Jan

    QUOTE(Christopher @ 25 Oct 2011 3:17 AM)
    Hey,

    I wasn't able to find a bug tracker for the YQL project so I'm hoping this is the right channel to report a bug. Please let me know if not.

    I'm trying to scrape OpenGraph meta tags from websites.
    Those meta tags look like this:



    And my YQL query:
    SELECT * FROM html WHERE url = 'http://domain.tld/foo' AND xpath='descendant-or-self::meta'

    The problem is the response. It correctly contains the "content" attribute and its value but it doesn't contain the "property" attribute.
    I suppose this is a bug.

    Here's the query in action:
    http://y.ahoo.it/mh9Vl

    Cheers,
    Christopher
    0
  • Hi Christopher,

    YQL still seems to be having trouble with the property attribute for HTML.  However, I'm replying here just to show you a workaround that I've used in the past that might be useful to you in the mean time.

    The workaround is to use XPath to query for the OG data. To do this, you can use the xml table. Note that HTML != XML, so the page that you're consuming must be converted to valid XML.  There are tools online for making XML out of HTML documents, the query below uses the W3C's tidy tool.

    select *
    from xml
    where url in (
        select content
        from uritemplate
        where template="http://services.w3.org/tidy/tidy?docAddr={addr}&forceXML=on"
          and addr="http://ogp.me"
    )
    and itemPath="//*[local-name()='meta' and starts-with(@property, 'og:')]"

    (Try this in the console)


    To make this useful, I make the query have addr=@uri, set up a query alias and send a uri=<whatever> in the YQL URL. This allows cleaner YQL URLs (and changing of the query if it's broken!), e.g. http://query.yahooapis.com/v1/public/yql/peter/og?uri=http://ogp.me&format=json 




    QUOTE(Christopher @ 30 Nov 2011 1:43 AM)
    Hey!

    sorry for pressuring but any news concerning this?
    Is there perhaps a workaround?

    Thanks.

    QUOTE(Jan Pipes @ 26 Oct 2011 5:40 PM)
    Hi Christopher,

    I've created a ticket (4940602) for tracking purposes. Taking a look.

    Thanks,

    Jan

    QUOTE(Christopher @ 25 Oct 2011 3:17 AM)
    Hey,

    I wasn't able to find a bug tracker for the YQL project so I'm hoping this is the right channel to report a bug. Please let me know if not.

    I'm trying to scrape OpenGraph meta tags from websites.
    Those meta tags look like this:



    And my YQL query:
    SELECT * FROM html WHERE url = 'http://domain.tld/foo' AND xpath='descendant-or-self::meta'

    The problem is the response. It correctly contains the "content" attribute and its value but it doesn't contain the "property" attribute.
    I suppose this is a bug.

    Here's the query in action:
    http://y.ahoo.it/mh9Vl

    Cheers,
    Christopher
    0
  • Pl add compat="html5"to your query and you should see the property attribute.

    Here is the relevant links showing the update

    http://www.yqlblog.net/blog/2012/01/17/recent-enhancement-to-the-html-table/

    http://developer.yahoo.com/yql/guide/yql-select-xpath.html


    0

Recent Posts

in YQL