Yahoo! provides a large number of RSS feeds, most of which conform to the RSS 2.0 specification. These are valid XML files and can be parsed using any of a number of Python XML libraries, but the most convenient way to extract data from them is to use the Universal Feed Parser library by Mark Pilgrim. The Feed Parser provides a consistent API for different flavours of RSS and Atom and is a powerful tool for dealing with syndication formats.
Here's how to access data from the Movies opening this week RSS feed using the Universal Feed Parser.
>>> news_rss_url = "http://rss.ent.yahoo.com/movies/thisweek.xml" >>> import feedparser >>> info = feedparser.parse(news_rss_url)
info.feed object now contains information about the feed:
>>> info.feed.title u'Movies Opening This Week' >>> info.feed.link u'http://movies.yahoo.com'
You can iterate over the
info.entries collection to process items from the feed:
>>> for entry in info.entries: ... print entry.title Poster Boy opens TBA 2004 (limited) Monster House opens July 21st, 2006 (wide) Lady in the Water opens July 21st, 2006 (wide) ...
You can extract further information from individual entry objects:
>>> entry = info.entries >>> entry.title u'Poster Boy opens TBA 2004 (limited)' >>> entry.summary u"For his re-election campaign, a right-wing senator enlists his gay son's help against his will." >>> entry.modified u'Thu, 20 Jul 2006 07:12:18 GMT' >>> entry.modified_parsed (2006, 7, 20, 7, 12, 18, 3, 201, 0)
entry.modified_parsed provides access to the modified date parsed in to a Python time-tuple, described in the documentation for the time module. To convert it in to a standard Python datetime object you must first convert it in to the number of seconds since the epoch (1st January 1970) and then convert that number to a datetime:
>>> entry.modified_parsed (2006, 7, 20, 7, 12, 18, 3, 201, 0) >>> datetime.datetime.fromtimestamp(time.mktime(entry.modified_parsed)) datetime.datetime(2006, 7, 20, 8, 12, 18)
Note that there is an hour's difference between the datetime object above and the original time tuple. This is due to daylight savings time being taken in to account during the conversion. To disable this, the last element in the time-tuple should be set to -1 (ignore DST):
>>> entry.modified_parsed (2006, 7, 20, 7, 12, 18, 3, 201, 0) >>> timetuple = list(entry.modified_parsed[0:8]) + [-1] >>> timetuple [2006, 7, 20, 7, 12, 18, 3, 201, -1] >>> datetime.datetime.fromtimestamp(time.mktime(timetuple)) datetime.datetime(2006, 7, 20, 7, 12, 18)
Related information on the web