Developer Network Home - Help

Mailtrust Hadoop Talk in Virginia on Monday (Hadoop and Distributed Computing at Yahoo!)

« Announcing the Hadoop Summit at Yahoo, March 25th, 2008 | Main | Upcoming HBase User Group Meeting In San Francisco »

Mailtrust Hadoop Talk in Virginia on Monday

February 22, 2008

Just a quick heads up to Hadoop fans in the Virginia area. Bill Boebel, CTO of Mailtrust, will be giving a MapReduce vs. SQL Talk on Monday the 25th. (Mailtrust is the email division of Rackspace, a large hosting provider.)

Stu Hood, one of Mailtrust's software engineers wrote about MapReduce at Rackspace back in January, detailing how they use Hadoop for processing "several hundred gigabytes of email log data" every day.

The way it works is that raw logs get streamed from hundreds of mail servers to the Hadoop Distributed File System (”HDFS”) in real time, and scheduled MapReduce jobs run to index the new data using Apache Lucene and Solr. Once the indexes have been built, they are compressed and stored away in HDFS. Each Hadoop datanode also runs a Tomcat servlet container, which hosts a number of Solr instances that pull and merge the new indexes, and provide really fast search results to our support team.
Additionally, using MapReduce we are now able to look at our log data in all sorts of interesting ways. For example, we run nightly MapReduce jobs to collect statistics about our mail system, such as spam counts by domain, bytes transferred and number of logins. Now whenever we think of complex question about our customers’ usage patterns, we can pull the answer from our logs within hours via MapReduce. This is powerful stuff.

Read the whole posting for some interesting email stats they extracted.

Bill's talk should provide an excellent overview of Hadoop and some good insight into the Rackspace deployment.

Jeremy Zawodny
Yahoo! Developer Network

Posted at February 22, 2008 11:11 AM

rss     Add to My! Yahoo

Comments

Hello, everyone.
Does anyone have experience running Hadoop cluster nodes on Amazon S3 storage via vendors like RightScale? IO throughput would be a big concern, right?
If anyone can comment on running Hadoop nodes on AS3, I really appreciate it.

Thanks.

Posted by: Trung Nguyen at February 22, 2008 5:59 PM

Post a comment




Remember Me?


Hadoop is a trademark of the Apache Software Foundation.

Copyright © 2008 Yahoo! Inc. All rights reserved.

Privacy Policy - Terms of Service - Copyright Policy - Job Openings