For the last several years, every company involved in building large web-scale systems has faced some of the same fundamental challenges. While nearly everyone agrees that the "divide-and-conquer using lots of cheap hardware" approach to breaking down large problems is the only way to scale, doing so is not easy.
The underlying infrastructure has always been a challenge. You have to buy, power, install, and manage a lot of servers. Even if you use somebody else's commodity hardware, you still have to develop the software that'll do the divide-and-conquer work to keep them all busy.
It's hard work. And it needs to be commoditized, just like the hardware has been...
We too have been dealing with this at Yahoo. Analyzing petabytes of data takes a lot of CPU power and storage. And given the way our needs (and the web as a whole) have been growing, there will likely be dozens of similarly demanding applications before long.
To build the necessary software infrastructure, we could have gone off to develop our own technology, treating it as a competitive advantage, and charged ahead. But we've taken a slightly different approach. Realizing that a growing number of companies and organizations are likely to need similar capabilities, we got behind the work of Doug Cutting (creator of the open source Nutch and Lucene projects) and asked him to join Yahoo to help deploy and continue working on the [then new] open source Hadoop project.
What started here as a 20 node cluster in March of 2006 was up to nearly 200 a month later and has continued to grow as it eats terabytes and terabytes of data. It wasn't long after that our code contributions back to Hadoop really started to ramp up as well.
Here's a quick timeline of how things have progressed since then...
- 2004 - Initial versions of what is now Hadoop Distributed File System and Map-Reduce implemented by Doug Cutting & Mike Cafarella
- December 2005 - Nutch ported to the new framework. Hadoop runs reliably on 20 nodes.
- January 2006 - Doug Cutting joins Yahoo!
- February 2006 - Apache Hadoop project official started to support the standalone development of Map-Reduce and HDFS.
- March 2006 - Formation of the Yahoo! Hadoop team
- May 2006 - Yahoo sets up a Hadoop research cluster - 300 nodes
- April 2006 - Sort benchmark run on 188 nodes in 47.9 hours
- May 2006 - Sort benchmark run on 500 nodes in 42 hours (better hardware than April benchmark)
- October 2006 - Research cluster reaches 600 Nodes
- December 2006 - Sort times 20 nodes in 1.8 hrs, 100 nodes in 3.3 hrs, 500 nodes in 5.2 hrs, 900 nodes in 7.8
- January 2006 - Research cluster reaches 900 node
- April 2007 - Research clusters - 2 clusters of 1000 nodes
By supporting and contributing to an open source grid computing project, we hope to be part of providing a solid, efficient, and scalable system that anyone can use to attack the types of problems and data sets that are becoming more common on the web. And since it's open source, everyone benefits from the expertise of developers and users around the world. We've already seen similar benefits from our use and support of Apache, PHP, and MySQL (just to name a few).
As we noted last week, Doug and Eric Baldeschwieler (Yahoo's Director of Grid Computing) are presenting Meet Hadoop at the 2007 Open Source Convention this week. While this is one of the first times we're really talking about our involvement with Hadoop in public, it certainly won't be the last.
Looking ahead and thinking about how the economics of large scale computing continue to improve, it's not hard to imagine a time when Hadoop and Hadoop-powered infrastructure is as common as the LAMP (Linux, Apache, MySQL, Perl/PHP/Python) stack that helped to powered the previous growth of the Web. We're already seeing universities begin to teach about Hadoop (University of Washington) and looking at building their own clusters (Carnegie Mellon University).
We're still in the very early days of this revolution and very proud to be part of it.
Yahoo! Developer Network