Somewhat to my surprise, I was recently asked why Yahoo has put so much into Apache Hadoop. We currently have nearly 100 people working on Apache Hadoop and related projects, such as Pig, ZooKeeper, Hive, Howl, HBase and Oozie. Over the last 5 years, we've invested nearly 300 person-years into these projects. The Hadoop team at Yahoo is so passionate about our open source mission, and we've been doing this for so long, that we tend to assume that everyone understands our position. The recent evidence to the contrary motivates this post.
Back in January 2006, when we decided to invest in scaling Hadoop from an interesting prototype to the robust scalable framework it is today, it was obvious that our direct competitors had or were building private implementations of map-reduce and clustered storage. We didn't believe that this type of infrastructure would bring sustainable advantage to any one competitor: the needs of Web Search at the time were driving everyone in in a similar direction. Thus, instead of building yet another private implementation, we believed that investing in an Open Source solution would bring Yahoo! numerous benefits.
From that initial kernel of an idea, we developed a wider list of positive outcomes we expected from investing in an open source map-reduce platform. We reasoned that, if the scientists we wanted to hire already knew Hadoop, then they would know Yahoo had a great big data platform. We believed the rigors of Open Source development would help produce better code and help us avoid maintaining an obsolete private infrastructure down the road. We predicted that everyone would gain from the creation of an ecosystem of users and developers about an open standard for big data management and analysis.
Looking back now, things have worked out even better than we predicted. Pretty much every big internet company is using Hadoop to some extent, including some we never expected, such as Microsoft and Google! By embracing Hadoop we have met our own needs for big data systems and gained several advantages we would not have from a private solution, including: Help building and testing Hadoop, access to talent trained on Hadoop and easier collaboration with others in our space, and good will from doing good.
Let's review our expected results from investing in Hadoop:
Help recruiting world class scientists - This has been a great success. Today Yahoo runs on over 40,000 Hadoop machines (>300k cores). They are used by over a thousand regular users from our science and development teams. Hadoop is at the center of our research in search, advertising, spam detection, personalization and many other topics. Our Hadoop infrastructure is key to our science efforts and these have lead to many revenue driving product improvements.
Help building Hadoop & new tools - We are now able to take advantage of big Hadoop projects like HBASE and Hive that we've not needed to build our selves. We're also seeing an explosion of interest in contributing to Hadoop itself. Apache Hadoop has added more committers in the last month than the size of the entire Hadoop developer community in 2006.
Access to trained talent and easier collaboration - It is now routine for us to hire scientists and developers with previous Hadoop experience. Many of our partners use Hadoop extensively too. Recently we acquired our first Hadoop based startup (dapper.net). This is a huge validation of our open strategy. At this point, it seems that half the startups in the valley have Hadoop and/or HBASE as a major part of their technology mix, so we're sure this will not be the last time we partner up with another company that uses Hadoop extensively.
Avoiding Obsolescence - Instead of trying to decide when to abandon our private solution in favor of a new industry standard, we are watching Hadoop itself become an industry standard. You can now get Hadoop support and tools from major enterprise players, such as Amazon, IBM and others. Hadoop is going to get better for years to come, with or without further investment from Yahoo.
Good will from doing good - Judging by metrics such as the size of our annual Hadoop Summits and HUGs, the traffic on the lists and the number of companies on the Powered by Hadoop page, Hadoop is now out there, solving world class problems in all sorts of places we never imagined. One of my favorite statistics is that Hadoop helps eHarmony drive 2% of US marriages (236 / day) (http://slidesha.re/aDkH6a).
So there you have it. The Yahoo team is committed to open source because we love seeing our work change the world and because open sourcing our work via Apache Hadoop remains the most cost effective way for Yahoo to meet key business goals.
eric14 a.k.a. Eric Baldeschwieler
VP Hadoop Software Development @Yahoo!