YQL from html doesn't fetch custom attributes


I'd like to scrape some html with custum attributes (<div my_data="blabla" />) but it looks like YQL only keeps regular html attributes.
Is there a way to avoid that?

Thank you,


3 Replies
  • This is most likely due to HTML Tidy (http://tidy.sourceforge.net/) being applied to the returned results from the HTML scrape. This is required because there are a lot of malformed HTML documents out there and the YQL parser needs to be able to build a well-formed XML document out of the results to allow you to pull data. I ran into this same issue when trying to extract <meta> tags with custom attributes.

    The one method I found for bypassing this is to make a custom GET request (in the server-side JavaScript layer) to fetch the entire HTML document, then manually parse the data that way.

    - Jon
  • HTML tidy now seems to be allowing custom attributes, but YQL is still not returning custom attributes. Please help.

  • by setting drop-proprietary-attributes = false i believe

  • You can get custom attributes by using compat="html5" with the html table.

    e.g. select * from html where url = "http://mydomain.com" and compat="html5"

    Thanks -Paul YQL Team


Recent Posts

in Suggestions for YDN