0

Bug report concerning scraping meta tags with "property" attribute

Hey,

I wasn't able to find a bug tracker for the YQL project so I'm hoping this is the right channel to report a bug. Please let me know if not.

I'm trying to scrape OpenGraph meta tags from websites.
Those meta tags look like this:

<meta property="og:title" content="lorem ipsum">

And my YQL query:
SELECT * FROM html WHERE url = 'http://domain.tld/foo' AND xpath='descendant-or-self::meta'

The problem is the response. It correctly contains the "content" attribute and its value but it doesn't contain the "property" attribute.
I suppose this is a bug.

Here's the query in action:
http://y.ahoo.it/mh9Vl

Cheers,
Christopher

by
4 Replies
  • Hi Christopher,<br><br>I&#39;ve created a ticket (4940602) for tracking purposes. Taking a look.<br><br>Thanks,<br><br>Jan<span style="font-size:136%;font-weight:bold;"> </span><br><br><div class="quote"><div class="quotetop">QUOTE<cite>(Christopher @ 25 Oct 2011 3:17 AM)</cite></div><blockquote class="quotemain">Hey,<br><br>I wasn&#39;t able to find a bug tracker for the YQL project so I&#39;m hoping this is the right channel to report a bug. Please let me know if not.<br><br>I&#39;m trying to scrape OpenGraph meta tags from websites.<br>Those meta tags look like this:<br><br><em><br></em><br>And my YQL query:<br><em>SELECT * FROM html WHERE url = &#39;http://domain.tld/foo&#39; AND xpath=&#39;descendant-or-self::meta&#39;<br></em><br>The problem is the response. It correctly contains the &quot;content&quot; attribute and its value but it doesn&#39;t contain the &quot;property&quot; attribute.<br>I suppose this is a bug.<br><br>Here&#39;s the query in action:<br>http://y.ahoo.it/mh9Vl<br><br>Cheers,<br>Christopher<br></blockquote></div>
    0
  • Hey!<br><br>sorry for pressuring but any news concerning this?<br>Is there perhaps a workaround?<br><br>Thanks.<br><br><div class="quote "><div class="quotetop ">QUOTE<cite>(Jan Pipes @ 26 Oct 2011 5:40 PM)</cite><blockquote class="quotemain">Hi Christopher,<br><br>I&#39;ve created a ticket (4940602) for tracking purposes. Taking a look.<br><br>Thanks,<br><br>Jan<span style="font-size:136%;font-weight:bold;"> </span><br><br><div class="quote "><div class="quotetop ">QUOTE<cite>(Christopher @ 25 Oct 2011 3:17 AM)</cite><blockquote class="quotemain">Hey,<br><br>I wasn&#39;t able to find a bug tracker for the YQL project so I&#39;m hoping this is the right channel to report a bug. Please let me know if not.<br><br>I&#39;m trying to scrape OpenGraph meta tags from websites.<br>Those meta tags look like this:<br><br><em><br></em><br>And my YQL query:<br><em>SELECT * FROM html WHERE url = &#39;http://domain.tld/foo&#39; AND xpath=&#39;descendant-or-self::meta&#39;<br></em><br>The problem is the response. It correctly contains the &quot;content&quot; attribute and its value but it doesn&#39;t contain the &quot;property&quot; attribute.<br>I suppose this is a bug.<br><br>Here&#39;s the query in action:<br>http://y.ahoo.it/mh9Vl<br><br>Cheers,<br>Christopher<br></blockquote></div></div></blockquote></div></div>
    0
  • Hi Christopher,<br><br>YQL still seems to be having trouble with the property attribute for HTML. &nbsp;However, I&#39;m replying here just to show you a workaround that I&#39;ve used in the past that might be useful to you in the mean time.<br><br>The workaround is to use XPath to query for the OG data. To do this, you can use the <span style="font-family:'Courier New';">xml</span><span style="font-family:Arial;"> table. Note that HTML != XML, so the page that you&#39;re consuming must be converted to valid XML. &nbsp;There are tools online for making XML out of HTML documents, the query below uses the W3C&#39;s tidy tool.<br></span><br><span style="font-family:'Courier New';">select *<br>from xml<br>where url in (<br>&nbsp; &nbsp; select content <br>&nbsp; &nbsp; from uritemplate <br>&nbsp; &nbsp; where template=&quot;http://services.w3.org/tidy/tidy?docAddr={addr}&forceXML=on&quot; <br>&nbsp; &nbsp; &nbsp; and addr=&quot;http://ogp.me&quot;<br>)<br>and itemPath=&quot;//*[local-name()=&#39;meta&#39; and starts-with(@property, &#39;og:&#39;)]&quot;<br></span><br>(<a href="http://y.ahoo.it/kxhdk">Try this in the console</a>)<br><br><br>To make this useful, I make the query have <span style="font-family:'Courier New';">addr=@uri, </span><span style="font-family:Arial;">set up a </span><a href="http://developer.yahoo.com/yql/guide/yql_url.html#yql-query-aliases">query alias</a> and send a <span style="font-family:'Courier New';">uri=&lt;whatever&gt;</span> in the YQL URL. This allows cleaner YQL URLs (and changing of the query if it&#39;s broken!), e.g.&nbsp;<a href="http://query.yahooapis.com/v1/public/yql/peter/og?uri=http://ogp.me&format=json">http://query.yahooapis.com/v1/public/yql/peter/og?uri=http://ogp.me&format=json</a>&nbsp;<br><br><br><br><br><div class="quote "><div class="quotetop ">QUOTE<cite>(Christopher @ 30 Nov 2011 1:43 AM)</cite><blockquote class="quotemain">Hey!<br><br>sorry for pressuring but any news concerning this?<br>Is there perhaps a workaround?<br><br>Thanks.<br><br><div class="quote "><div class="quotetop ">QUOTE<cite>(Jan Pipes @ 26 Oct 2011 5:40 PM)</cite><blockquote class="quotemain">Hi Christopher,<br><br>I&#39;ve created a ticket (4940602) for tracking purposes. Taking a look.<br><br>Thanks,<br><br>Jan<span style="font-size:136%;font-weight:bold;"> </span><br><br><div class="quote "><div class="quotetop ">QUOTE<cite>(Christopher @ 25 Oct 2011 3:17 AM)</cite><blockquote class="quotemain">Hey,<br><br>I wasn&#39;t able to find a bug tracker for the YQL project so I&#39;m hoping this is the right channel to report a bug. Please let me know if not.<br><br>I&#39;m trying to scrape OpenGraph meta tags from websites.<br>Those meta tags look like this:<br><br><em><br></em><br>And my YQL query:<br><em>SELECT * FROM html WHERE url = &#39;http://domain.tld/foo&#39; AND xpath=&#39;descendant-or-self::meta&#39;<br></em><br>The problem is the response. It correctly contains the &quot;content&quot; attribute and its value but it doesn&#39;t contain the &quot;property&quot; attribute.<br>I suppose this is a bug.<br><br>Here&#39;s the query in action:<br>http://y.ahoo.it/mh9Vl<br><br>Cheers,<br>Christopher<br></blockquote></div></div></blockquote></div></div></blockquote></div></div>
    0
  • Pl add <code class="code">compat=&quot;html5&quot;to your query and you should see the property attribute.<br><br>Here is the relevant links showing the update<br><br>http://www.yqlblog.net/blog/2012/01/17/recent-enhancement-to-the-html-table/<br><br>http://developer.yahoo.com/yql/guide/yql-select-xpath.html<br><br><br></code>
    0

Recent Posts

in YQL