Parse XML using Python

Most of the YDN APIs can provide their output in JSON format, which side-steps the problem of having to parse data out of them; the data arrives already converted in to a useful data structure. If the API you are using does not yet offer JSON output you can take advantage of Python's excellent XML support.

Using minidom

The most widely understood API for manipulating XML is the W3C-approved DOM. Python ships with both a full DOM implementation and xml.dom.minidom, a more lightweight implementation. minidom is more than capable of dealing with the XML returned by Yahoo!'s APIs.

As an example, let's use minidom to extract weather information for a specific zip code using the Weather API.

import urllib
from xml.dom import minidom

WEATHER_URL = 'http://xml.weather.yahoo.com/forecastrss?p=%s'
WEATHER_NS = 'http://xml.weather.yahoo.com/ns/rss/1.0'

def weather_for_zip(zip_code):
    url = WEATHER_URL % zip_code
    dom = minidom.parse(urllib.urlopen(url))

minidom.parse() takes a file-like object. Here we are using the one returned by urllib.urlopen().

    forecasts = []
    for node in dom.getElementsByTagNameNS(WEATHER_NS, 'forecast'):
    forecasts.append({
        'date': node.getAttribute('date'),
        'low': node.getAttribute('low'),
        'high': node.getAttribute('high'),
        'condition': node.getAttribute('text')
    })

getElementsByTagNameNS() is a namespace-aware method that takes two arguments; the first is the namespace URL and the second is the name of the tag. We could use getElementsByTagName('yweather:forecast') here instead, but using namespaces is good practise as it makes our code more robust.

    ycondition = dom.getElementsByTagNameNS(WEATHER_NS, 'condition')[0]
    return {
        'current_condition': ycondition.getAttribute('text'),
        'current_temp': ycondition.getAttribute('temp'),
        'forecasts': forecasts,
        'title': dom.getElementsByTagName('title')[0].firstChild.data
    }

The last line retrieves the title by accessing firstChild.data of the first title element in the document.

Here's the code in full:

import urllib
from xml.dom import minidom

WEATHER_URL = 'http://xml.weather.yahoo.com/forecastrss?p=%s'
WEATHER_NS = 'http://xml.weather.yahoo.com/ns/rss/1.0'

def weather_for_zip(zip_code):
    url = WEATHER_URL % zip_code
    dom = minidom.parse(urllib.urlopen(url))
    forecasts = []
    for node in dom.getElementsByTagNameNS(WEATHER_NS, 'forecast'):
        forecasts.append({
            'date': node.getAttribute('date'),
            'low': node.getAttribute('low'),
            'high': node.getAttribute('high'),
            'condition': node.getAttribute('text')
        })
    ycondition = dom.getElementsByTagNameNS(WEATHER_NS, 'condition')[0]
    return {
        'current_condition': ycondition.getAttribute('text'),
        'current_temp': ycondition.getAttribute('temp'),
        'forecasts': forecasts,
        'title': dom.getElementsByTagName('title')[0].firstChild.data
    }

Let's try it out for Lawrence, KS (zip code 66044):

>>> from pprint import pprint
>>> pprint(weather_for_zip(66044))
{'current_condition': u'Fair',
 'current_temp': u'85',
 'forecasts': [{'condition': u'Mostly Sunny',
                'date': u'20 Jul 2006',
                'high': u'103',
                'low': u'75'},
               {'condition': u'Scattered Thunderstorms',
                'date': u'21 Jul 2006',
                'high': u'82',
                'low': u'61'}],
 'title': u'Yahoo! Weather - Lawrence, KS'}

Using ElementTree

The DOM API is a useful standard, but it's not the most Pythonic way of dealing with XML. There have been a number of attempts at making XML handling in Python feel more natural, but the most successful is probably ElementTree by Fredrik Lundh, which has several different implementations and is scheduled for inclusion in the forthcoming Python 2.5.

Let's try the weather example again, this time using the ElementTree API:

import urllib
from elementtree.ElementTree import parse

WEATHER_URL = 'http://xml.weather.yahoo.com/forecastrss?p=%s'
WEATHER_NS = 'http://xml.weather.yahoo.com/ns/rss/1.0'

def weather_for_zip(zip_code):
	url = WEATHER_URL % zip_code
	rss = parse(urllib.urlopen(url)).getroot()

The ElementTree parse function takes a file-like object and returns an ElementTree object, representing the overall XML file. The getroot() method of that object retrieves the root Element within that tree; in this case, the <rss> element.

    forecasts = []
        for element in rss.findall('channel/item/{%s}forecast' % WEATHER_NS):
            forecasts.append({
                'date': element.get('date'),
                'low': element.get('low'),
                'high': element.get('high'),
                'condition': element.get('text')
            })

The findall method takes an XPath pattern (ElementTree supports a limited subset of XPath) and returns a list of matching elements. Here the pattern searches for <yweather:forecast> elements that are children of an <item> that is a child of a <channel>. These elements have a get() method for accessing their attributes.

The next XPath pattern finds <yweather:condition> elements.

    ycondition = rss.find('channel/item/{%s}condition' % WEATHER_NS)
    return {
        'current_condition': ycondition.get('text'),
        'current_temp': ycondition.get('temp'),
        'forecasts': forecasts,
        'title': rss.findtext('channel/title')
    }

That last line uses the findtext() method to access the text contained by the first element matching the channel/title XPath pattern.

Here's the code in full:

import urllib
from elementtree.ElementTree import parse

WEATHER_URL = 'http://xml.weather.yahoo.com/forecastrss?p=%s'
WEATHER_NS = 'http://xml.weather.yahoo.com/ns/rss/1.0'

def weather_for_zip(zip_code):
    url = WEATHER_URL % zip_code
    rss = parse(urllib.urlopen(url)).getroot()
    forecasts = []
    for element in rss.findall('channel/item/{%s}forecast' % WEATHER_NS):
        forecasts.append({
            'date': element.get('date'),
            'low': element.get('low'),
            'high': element.get('high'),
            'condition': element.get('text')
        })
    ycondition = rss.find('channel/item/{%s}condition' % WEATHER_NS)
    return {
        'current_condition': ycondition.get('text'),
        'current_temp': ycondition.get('temp'),
        'forecasts': forecasts,
        'title': rss.findtext('channel/title')
    }

Further reading

Related information on the web