Cache Yahoo! Web Service Calls using Python

One way of dramatically speeding up applications that are built using the Yahoo! Web Services APIs is to make heavy use of caching. With caching, calls that have been previously made in a specified time frame can be answered using cached data rather than making an API call over the network. This HOWTO describes two methods of caching at the HTTP retrieval layer.

Caching Intro

The level of caching you use should take in to account the kind of data you are retrieving. You might be building an application that redisplays your Flickr photo sets on your own site. These sets change rarely, and you don't mind if you there is a delay of up to 12 hours before a new set appears on your site. Contrast this with redisplaying your most recent links from del.icio.us, where you might want them to show up on your own site straight away or within 5 or 10 minutes.

As a reminder, the simplest way of retrieving data over HTTP is to use the urllib.urlopen() function. Let's wrap that in a simple function:

import urllib

def fetch(url):
    return urllib.urlopen(url).read()

Caching in memory

Let's create a version of the fetch() function that has its own caching layer.

import time, urllib

class CacheFetcher:
    def __init__(self):
        self.cache = {}
    def fetch(self, url, max_age=0):
        if self.cache.has_key(url):
            if int(time.time()) - self.cache[url][0] < max_age:
                return self.cache[url][1]
        # Retrieve and cache
        data = urllib.urlopen(url).read()
        self.cache[url] = (time.time(), data)
        return data

Create an instance of the CacheFetcher class:

>>> fetcher = CacheFetcher()

Now retrieve a URL, specifying that it should not be retrieved it if it has been cached in the last 60 seconds:

>>> data = fetcher.fetch('http://developer.yahoo. com/', 60)

If you try this in an interactive prompt there should be a short delay before the data is returned. Run the command again and the data will be returned instantly; it is already in the cache.

Caching to disk

CacheFetcher is only useful for long-running Python programs, as the cache itself is stored in memory. Here is an alternative implementation that saves cached data to disk; this can be used by multiple Python processes, as is generally the case with Python web applications.

import md5, os, tempfile, time

class DiskCacheFetcher:
    def __init__(self, cache_dir=None):
        # If no cache directory specified, use system temp directory
        if cache_dir is None:
            cache_dir = tempfile.gettempdir()
        self.cache_dir = cache_dir
    def fetch(self, url, max_age=0):
        # Use MD5 hash of the URL as the filename
        filename = md5.new(url).hexdigest()
        filepath = os.path.join(self.cache_dir, filename)
        if os.path.exists(filepath):
            if int(time.time()) - os.path.getmtime(filepath) < max_age:
                return open(filepath).read()
        # Retrieve over HTTP and cache, using rename to avoid collisions
        data = urllib.urlopen(url).read()
        fd, temppath = tempfile.mkstemp()
        fp = os.fdopen(fd, 'w')
        fp.write(data)
        fp.close()
        os.rename(temppath, filepath)
        return data

Usage is similar to the in-memory cache:

fetcher = DiskCacheFetcher('/path/to/cache/directory')
    fetcher.fetch('https://developer.yahoo.com/', 60)

If no argument is provided to the DiskCacheFetcher constructor, the default temp directory will be used to store the cache files. On Unix-based systems, this is /tmp.

These functions can now be used in place of direct calls to urllib.urlopen(). This provides a simple but robust mechanism for caching API calls, speeding up your application and reducing the number of overall calls you have to make.

Yahoo Forum Discussions