
« Hadoop Mention on Yahoo! Earnings Call | Main | Announcing the Hadoop Summit at Yahoo, March 25th, 2008 »
February 19, 2008
Yahoo! recently launched what we believe is the worlds largest Apache Hadoop production application. The Yahoo! Search Webmap is a Hadoop application that runs on a more than 10,000 core Linux cluster and produces data that is now used in every Yahoo! Web search query.
The Webmap build starts with every Web page crawled by Yahoo! and produces a database of all known Web pages and sites on the internet and a vast array of data about every page and site. This derived data feeds the Machine Learned Ranking algorithms at the heart of Yahoo! Search.
Some Webmap size data:
This process is not new (see the AltaVista connectivity server). What is new is the use of Hadoop. Hadoop has allowed us to run the identical processing we ran pre-Hadoop on the same cluster in 66% of the time our previous system took. It does that while simplifying administration. Further we believe that as we continue to scale up Hadoop, we will be able to scale up our production jobs as needed to larger cluster sizes.
Our team is very excited about the deployment of the Yahoo! Webmap on Hadoop because it demonstrates that although Hadoop is still at a very early stage in its development (perhaps even immature), Hadoop is now capable of handling truly Internet scale projects in a cost effective manner. This and a number of other production system deployments in Yahoo! and other organizations demonstrate that Hadoop is gaining traction in the market and adding real value.
The Yahoo! Grid team has been enhancing and using Hadoop for various research and development tasks since march 2006. We are proud of our role in taking Hadoop from a system that worked on dozens of computers two years ago, to a system that runs on thousands of computers today. The Webmap launch demonstrates the power of Hadoop to solve truly Internet-sized problems and to function reliably in a large scale production setting. We can now say that the results generated by the billions of Web search queries run at Yahoo! every month depend to a large degree on data produced by Hadoop clusters.
For more details about Yahoo!s Webmap project and the work that has gone into scaling Hadoop to support it, see an interview with two long-time colleges of mine, Arnab Bhattacharjee (manager of the Yahoo! Webmap Team) and Sameer Paranjpye (manager of our Hadoop development), embedded above.
Eric Baldeschwieler
Senior Director, Grid Computing
Yahoo! Inc.
Posted at February 19, 2008 7:13 AM
is "66% faster" normalized with respect to the number of cores used? (that is, were you also previously using 10k cores?). Another interesting figure would be the average core utilization over the cluster over the job. Would you know that one? :-)
Giovanni
Posted by: Giovanni Tummarello at February 19, 2008 11:21 AM
66% of the time != 66% faster
Posted by: at February 19, 2008 11:24 AM
Wow, on the same machines? What happened to Yahoo during the switch over?
Posted by: Tim Wintle at February 19, 2008 1:02 PM
Wow, the same 10,000 machines? what happened to Yahoo during the switch over?
Posted by: Tim Wintle at February 19, 2008 1:03 PM
Awesome! This has got to be sweet.. Congratulations! :)
Posted by: Naveen Koorakula at February 19, 2008 1:16 PM
This does run on the same machines. We have a couple of banks to allow us to roll on new software, so the transition was seamless.
(And thanks naveen, it is a fun milestone)
Posted by: Eric Baldeschwieler at February 19, 2008 4:01 PM
Congratulations to the Yahoo search and grid computing teams and to the Hadoop community on this fantastic accomplishment!
Posted by: Chad Walters at February 19, 2008 5:33 PM
Congratulations!
One question, what's the main difference of hadoop between adopted by yahoo and open source.
Posted by: Mafish at February 19, 2008 5:59 PM
Good question Mafish, I would also like to know when/if all internal changes will be released to the publicly available version of Hadoop.
Posted by: Thijs at February 19, 2008 6:26 PM
We use the same version of Hadoop internally that's available from the public code repository. We don't have an "internal version" of Hadoop at Yahoo.
Posted by: Jeremy Zawodny at February 19, 2008 6:31 PM
Cool man this is going to beat Google soon.
Yahoo is all way the best because its first
Posted by: ARUN REDDY. M at February 19, 2008 7:08 PM
Congratulations to the Yahoo Hadoop team on the very impressive Hadoop rollout.
I wonder what OS & Java version Hadoop nodes run on at Yahoo. Is it FreeBSD?
Thanks.
Posted by: Trung Nguyen at February 19, 2008 8:38 PM
This deployment was on Linux. We're using Java 6 which has some nice scaling features for servers.
Posted by: Sameer Paranjpye at February 19, 2008 11:40 PM
This deployment was on Linux. We're using Java 6 which has some nice scaling features for nio Selectors etc.
Posted by: Sameer Paranjpye at February 19, 2008 11:47 PM
Can you look at getting a flash player with better buffering? Requiring a constant 500kb/s is a bit hard for some people's Internet. Usually I hit "play" on something and then "Pause" until it has buffered well ahead but you client doesn't appear to support this so I am getting a very blippy experience. Sorry :(
Posted by: Simon at February 20, 2008 2:41 AM
Congratulations for the milestone!
But it's unbelivable you haven't made any changes to hadoop, because many parts of it are unmature.
Posted by: Desertfox at February 20, 2008 10:46 PM
If a researcher is interested on running some Web algorithm on, say, an 80 core cluster (scale of a university lab cluster), would a 4 month for a summer intern be sufficient to get it configured, tuned, and working?
Posted by: Mike at February 22, 2008 8:14 PM
Mike - it really shouldn't much time at all... please take a look at the documentation at
http://hadoop.apache.org/core/docs/r0.16.0/ and shout out at core-user@hadoop.apache.org if you need any clarifications.
Posted by: Arun C Murthy at February 25, 2008 9:53 AM
Mike - it really should be quite easy to setup. Please take a look at the documentation at http://hadoop.apache.org/core/docs/r0.16.0/ and shout out at core-user@hadoop.apache.org if you need any clarifications/help.
Posted by: Arun C Murthy at February 25, 2008 9:54 AM
Mike- We built a 64-node cluster at our university in very quickly and had created some sizable programs within three months. Definitely do-able. I've documented some of it on my blog.
Posted by: Jakob Homan at February 28, 2008 2:39 PM
Hadoop is a trademark of the Apache Software Foundation.
Copyright © 2008 Yahoo! Inc. All rights reserved.
Privacy Policy - Terms of Service - Copyright Policy - Job Openings