Just a quick heads up to Hadoop fans in the Virginia area. Bill Boebel, CTO of Mailtrust, will be giving a MapReduce vs. SQL Talk on Monday the 25th. (Mailtrust is the email division of Rackspace, a large hosting provider.)
Stu Hood, one of Mailtrust's software engineers wrote about MapReduce at Rackspace back in January, detailing how they use Hadoop for processing "several hundred gigabytes of email log data" every day.
The way it works is that raw logs get streamed from hundreds of mail servers to the Hadoop Distributed File System (”HDFS”) in real time, and scheduled MapReduce jobs run to index the new data using Apache Lucene and Solr. Once the indexes have been built, they are compressed and stored away in HDFS. Each Hadoop datanode also runs a Tomcat servlet container, which hosts a number of Solr instances that pull and merge the new indexes, and provide really fast search results to our support team.
Additionally, using MapReduce we are now able to look at our log data in all sorts of interesting ways. For example, we run nightly MapReduce jobs to collect statistics about our mail system, such as spam counts by domain, bytes transferred and number of logins. Now whenever we think of complex question about our customers’ usage patterns, we can pull the answer from our logs within hours via MapReduce. This is powerful stuff.
Read the whole posting for some interesting email stats they extracted.
Bill's talk should provide an excellent overview of Hadoop and some good insight into the Rackspace deployment.
Yahoo! Developer Network