interesting news article/blog post scraping problem

i need to scrape the text of blog posts to build a summary description of the blog posts similar to what techmeme.com does. not a problem when it's one or a handful of blog posts. however, the possible blogs from which to scrape the text is variable and unlimited. how would you go about doing this?

i've used the html agility pack and yql in the past, but there's nothing built-in either of those solutions to handle this requirement.

one thought i had was to search for div ids and div attributes named things like content, post, article etc and see how that worked - not really leaning this direction. the other idea was to search for the biggest text node in the html document and assume that's the node i want - could lead to some false positives. the final idea was to endeavor to create a crowdsourced data repository on google apps that would allow for the community to manage (read: create, update, delete) the xpath mappings for most of the popular news/blog platforms then you could query this list by domain or blog platform type and get the requisite xpath - but this seems like a hella undertaking.

of course, i know some of you have ideas that will work better than any of my hair-brained ideas.

what are your thoughts?

0 Replies

Recent Posts

in YQL