Yahoo! recently launched what we believe is the worlds largest Apache Hadoop production application. The Yahoo! Search Webmap is a Hadoop application that runs on a more than 10,000 core Linux cluster and produces data that is now used in every Yahoo! Web search query.
The Webmap build starts with every Web page crawled by Yahoo! and produces a database of all known Web pages and sites on the internet and a vast array of data about every page and site. This derived data feeds the Machine Learned Ranking algorithms at the heart of Yahoo! Search.
Some Webmap size data:
- Number of links between pages in the index: roughly 1 trillion links
- Size of output: over 300 TB, compressed!
- Number of cores used to run a single Map-Reduce job: over 10,000
- Raw disk used in the production cluster: over 5 Petabytes
This process is not new (see the AltaVista connectivity server). What is new is the use of Hadoop. Hadoop has allowed us to run the identical processing we ran pre-Hadoop on the same cluster in 66% of the time our previous system took. It does that while simplifying administration. Further we believe that as we continue to scale up Hadoop, we will be able to scale up our production jobs as needed to larger cluster sizes.
Our team is very excited about the deployment of the Yahoo! Webmap on Hadoop because it demonstrates that although Hadoop is still at a very early stage in its development (perhaps even immature), Hadoop is now capable of handling truly Internet scale projects in a cost effective manner. This and a number of other production system deployments in Yahoo! and other organizations demonstrate that Hadoop is gaining traction in the market and adding real value.
The Yahoo! Grid team has been enhancing and using Hadoop for various research and development tasks since march 2006. We are proud of our role in taking Hadoop from a system that worked on dozens of computers two years ago, to a system that runs on thousands of computers today. The Webmap launch demonstrates the power of Hadoop to solve truly Internet-sized problems and to function reliably in a large scale production setting. We can now say that the results generated by the billions of Web search queries run at Yahoo! every month depend to a large degree on data produced by Hadoop clusters.
For more details about Yahoo!s Webmap project and the work that has gone into scaling Hadoop to support it, see an interview with two long-time colleges of mine, Arnab Bhattacharjee (manager of the Yahoo! Webmap Team) and Sameer Paranjpye (manager of our Hadoop development), embedded above.
Senior Director, Grid Computing