Developer Network Home - Help

Hadoop and Distributed Computing at Yahoo!: February 2008 Archives

« January 2008 | Main | March 2008 »

Grid Computing Archive

February 28, 2008

Hadoop Summit Nearly Full! We'll Webcast and Post Video...

We've been quite impressed by the overwhelming interest in the upcoming Hadoop Summit in late March. Over 150 people have marked themselves as attending or watching on Upcoming.org.

In fact, we hit our attendance cap a mere 36 hours after announcing the event. So there's clearly a lot of interest in the program. As a result, we're opening up more slots today.

We're going to do our best to make as much of the proceedings available on-line as quickly as possible. We'd already planned to make high quality video recordings of all the talks available on YDN Theather a week or so after the event.

Now we're also looking at ways to open it up during the event as well. Expect to see a live webcast and a live chat room so that you can ask questions and chat with those who are physically in Sunnyvale, California for the summit. Once all the details are nailed down, we'll post on the Hadoop blog and on the Hadoop Summit site as well.

We could use IRC, Yahoo! Live, or other options. If you have an opinion, drop a note in the comments.

Thanks for all the interest! This is an exciting time for Hadoop.

Jeremy Zawodny
Yahoo! Developer Network

Posted by jzawodn at 12:10 PM | Comments (1) | TrackBack

February 27, 2008

Upcoming HBase User Group Meeting In San Francisco

This is just a quick heads-up to let everyone know that the folks at Powerset are hosting a meeting focused on HBase in early March:

Powerset is hosting the first user group meeting for HBase, a robust, scalable, distributed, column-oriented store capable of hosting billions of row of sparse, structured data. HBase is open source and part of the Hadoop project.
Definitely come if you're a current user of HBase; or if your company has plans for a huge data store and you're evaluating solutions.

See the event on upcoming.org to sign up.

Jeremy Zawodny
Yahoo! Developer Network

Posted by jzawodn at 3:25 PM | Comments (2) | TrackBack

February 22, 2008

Mailtrust Hadoop Talk in Virginia on Monday

Just a quick heads up to Hadoop fans in the Virginia area. Bill Boebel, CTO of Mailtrust, will be giving a MapReduce vs. SQL Talk on Monday the 25th. (Mailtrust is the email division of Rackspace, a large hosting provider.)

Stu Hood, one of Mailtrust's software engineers wrote about MapReduce at Rackspace back in January, detailing how they use Hadoop for processing "several hundred gigabytes of email log data" every day.

The way it works is that raw logs get streamed from hundreds of mail servers to the Hadoop Distributed File System (”HDFS”) in real time, and scheduled MapReduce jobs run to index the new data using Apache Lucene and Solr. Once the indexes have been built, they are compressed and stored away in HDFS. Each Hadoop datanode also runs a Tomcat servlet container, which hosts a number of Solr instances that pull and merge the new indexes, and provide really fast search results to our support team.
Additionally, using MapReduce we are now able to look at our log data in all sorts of interesting ways. For example, we run nightly MapReduce jobs to collect statistics about our mail system, such as spam counts by domain, bytes transferred and number of logins. Now whenever we think of complex question about our customers’ usage patterns, we can pull the answer from our logs within hours via MapReduce. This is powerful stuff.

Read the whole posting for some interesting email stats they extracted.

Bill's talk should provide an excellent overview of Hadoop and some good insight into the Rackspace deployment.

Jeremy Zawodny
Yahoo! Developer Network

Posted by jzawodn at 11:11 AM | Comments (1) | TrackBack

February 20, 2008

Announcing the Hadoop Summit at Yahoo, March 25th, 2008

With all the growing interest in Hadoop (especially after yesterday's news), it seems like a fitting time to mention the Hadoop Summit we're hosting on March 25th:

Yahoo! is hosting the first summit on Apache Hadoop on March 25th in Sunnyvale. The summit is sponsored by the Computing Community Consortium (CCC) and brings together leaders from the Hadoop developer and user communities. Hadoop is now being used by companies in production environments, by academic and industrial research groups, and at universites for teaching data parallel computing. The speakers will cover topics in the areas of extensions being developed for Hadoop, case studies of applications being built and deployed on Hadoop, and a discussion on future directions for the platform.

The lineup of speakers is a who's who of the Hadoop community and they all have some great experience to share.

Sign up over on Upcoming.org.

Jeremy Zawodny
Yahoo! Developer Network

Posted by jzawodn at 11:29 AM | Comments (3) | TrackBack

February 19, 2008

Yahoo! Launches World's Largest Hadoop Production Application

Yahoo! recently launched what we believe is the worlds largest Apache Hadoop production application. The Yahoo! Search Webmap is a Hadoop application that runs on a more than 10,000 core Linux cluster and produces data that is now used in every Yahoo! Web search query.

The Webmap build starts with every Web page crawled by Yahoo! and produces a database of all known Web pages and sites on the internet and a vast array of data about every page and site. This derived data feeds the Machine Learned Ranking algorithms at the heart of Yahoo! Search.

Some Webmap size data:

This process is not new (see the AltaVista connectivity server). What is new is the use of Hadoop. Hadoop has allowed us to run the identical processing we ran pre-Hadoop on the same cluster in 66% of the time our previous system took. It does that while simplifying administration. Further we believe that as we continue to scale up Hadoop, we will be able to scale up our production jobs as needed to larger cluster sizes.

Our team is very excited about the deployment of the Yahoo! Webmap on Hadoop because it demonstrates that although Hadoop is still at a very early stage in its development (perhaps even immature), Hadoop is now capable of handling truly Internet scale projects in a cost effective manner. This and a number of other production system deployments in Yahoo! and other organizations demonstrate that Hadoop is gaining traction in the market and adding real value.

The Yahoo! Grid team has been enhancing and using Hadoop for various research and development tasks since march 2006. We are proud of our role in taking Hadoop from a system that worked on dozens of computers two years ago, to a system that runs on thousands of computers today. The Webmap launch demonstrates the power of Hadoop to solve truly Internet-sized problems and to function reliably in a large scale production setting. We can now say that the results generated by the billions of Web search queries run at Yahoo! every month depend to a large degree on data produced by Hadoop clusters.

For more details about Yahoo!s Webmap project and the work that has gone into scaling Hadoop to support it, see an interview with two long-time colleges of mine, Arnab Bhattacharjee (manager of the Yahoo! Webmap Team) and Sameer Paranjpye (manager of our Hadoop development), embedded above.

Eric Baldeschwieler
Senior Director, Grid Computing
Yahoo! Inc.

Posted by jzawodn at 7:13 AM | Comments (20) | TrackBack

February 4, 2008

Hadoop Mention on Yahoo! Earnings Call

In her prepared remarks during last week's quarterly earnings call, Yahoo! President Sue Decker said the following:

Although we haven’t elaborated on this much publicly, on the back end we have made a major investment in open source development of grid computing which provides a substantially greater scalability at fast iteration on core technologies. This is already dramatically impacting our competitiveness in algorithmic search and advertising.
For example, in some cases we have an order of magnitude 10x improvement in indexing speed. This has been a multi-year project and we’re on track to have our future Search and advertising systems built on the new infrastructure, positioning us well for acceleration in iteration and experiments that are likely to lead to significant future product enhancements.

In other words, we're using Hadoop more and more internally to dramatically improve some of our operations. We'll be talking a lot more about that in the coming weeks, along with introducing some key members of the grid team here at Yahoo!

Stay tuned...

Jeremy Zawodny
Yahoo! Developer Network

Posted by jzawodn at 5:49 PM | Comments (0) | TrackBack

Copyright © 2008 Yahoo! Inc. All rights reserved.

Privacy Policy - Terms of Service - Copyright Policy - Job Openings

d