0

Work-around illegal XML characters?

I'm having trouble parsing certain RSS feeds with YQL, due to control characters (ie vertical tab (VT), group separator (GS), etc) in the feeds. These characters are illegal in XML, so presumably are causing it to barf. For example, this feed from craigslist: http://losangeles.craigslist.org/search/fu...&format=rss

Since the likelihood of craigslist properly sanitizing their feeds in the near future is slim, I'm hoping there's some way I can work around this with YQL. (A pre-filter of some kind? Or just a switch on the feed/rss table that allows it to ignore illegal characters?) The data is all there; it's just being ignored due to these chars.

Thanks in advance...

by
4 Replies
  • Bump..
    0
  • Hi Nathan,

    Don't get your problem at all, when you get the data using YQL, the system will try to tidy up the content, so doing the following query

    CODE
    select * from rss where url='http://losangeles.craigslist.org/search/fua?query=pub+table+&srchType=A&sort=date&format=rss'


    I guess you should get your data clean.

    Cheers,
    Francisco.


    QUOTE (Nathan Stretch @ Aug 25 2010, 12:55 PM) <{POST_SNAPBACK}>
    I'm having trouble parsing certain RSS feeds with YQL, due to control characters (ie vertical tab (VT), group separator (GS), etc) in the feeds. These characters are illegal in XML, so presumably are causing it to barf. For example, this feed from craigslist: http://losangeles.craigslist.org/search/fu...&format=rss

    Since the likelihood of craigslist properly sanitizing their feeds in the near future is slim, I'm hoping there's some way I can work around this with YQL. (A pre-filter of some kind? Or just a switch on the feed/rss table that allows it to ignore illegal characters?) The data is all there; it's just being ignored due to these chars.

    Thanks in advance...
    0
  • Hi Nathan,

    Don't get your problem at all, when you get the data using YQL, the system will try to tidy up the content, so doing the following query

    CODE
    select * from rss where url='http://losangeles.craigslist.org/search/fua?query=pub+table+&srchType=A&sort=date&format=rss'


    I guess you should get your data clean.

    Cheers,
    Francisco.


    QUOTE (Nathan Stretch @ Aug 25 2010, 12:55 PM) <{POST_SNAPBACK}>
    I'm having trouble parsing certain RSS feeds with YQL, due to control characters (ie vertical tab (VT), group separator (GS), etc) in the feeds. These characters are illegal in XML, so presumably are causing it to barf. For example, this feed from craigslist: http://losangeles.craigslist.org/search/fu...&format=rss

    Since the likelihood of craigslist properly sanitizing their feeds in the near future is slim, I'm hoping there's some way I can work around this with YQL. (A pre-filter of some kind? Or just a switch on the feed/rss table that allows it to ignore illegal characters?) The data is all there; it's just being ignored due to these chars.

    Thanks in advance...
    0
  • QUOTE (arcturus @ Oct 26 2010, 09:24 AM) <{POST_SNAPBACK}>
    Hi Nathan,

    Don't get your problem at all, when you get the data using YQL, the system will try to tidy up the content, so doing the following query

    CODE
    select * from rss where url='http://losangeles.craigslist.org/search/fua?query=pub+table+&srchType=A&sort=date&format=rss'


    I guess you should get your data clean.

    Cheers,
    Francisco.


    Hi Francisco,

    Thanks for your response. The example I posted works now because my original post was in August, and the craigslist ad with the illegal characters has since been removed. I don't have another example right now, but I will post again when I find or have a chance to make one. Basically though, any feed with 'control' ascii characters, such as the vertical tab, will break YQL's parsing.
    0

Recent Posts

in YQL