Announcing the Yahoo! Distribution of Hadoop

Today we're announcing the general availability of the Yahoo! Distribution of Hadoop, a source-only distribution of Apache Hadoop that we deploy here at Yahoo!.

In my role as quality and release engineering manager for grid technologies at Yahoo!, including Hadoop, I'm really excited about what this release means for the larger Hadoop ecosystem. Here's why:
  1. We're opening up the results of our investment in quality engineering and scale deployments to the Apache Hadoop community and surrounding ecosystem.
  2. We're publishing a frequent source distribution that provides a robust foundation on which others can build and deploy their own enterprise distributions, support, and solutions.
  3. We're committing to keep all of our source code changes for our distributions available as patches in the Apache Hadoop community.

Opening our investment in quality engineering and scale deployments

We spend thousands of machine hours to test each release of Hadoop that we deploy internally. We run automated unit, functional, system, and performance tests over a 2-day period on our 500-machine test cluster. This includes interoperability testing of the cross-cluster data-copying tool (distcp), HDFS and MapReduce benchmarks, and various fault scenarios. All of the unit and performance tests are currently available in Apache Hadoop. We are working towards contributing the functional and system tests back to the community.

We deploy Hadoop on tens of thousands of machines. These machines are divided into a few tiers, each with many large clusters. In order to support internal feature requests and reliability requirements, we test and deploy frequent bug fix and feature releases to an experimental tier of clusters. Once stabilized sufficiently, these releases progress to additional tiers, eventually landing on a production tier, where Hadoop provides a mission critical platform for many core business units at Yahoo! As a release stabilizes and progresses to new tiers, we inevitably discover, fix, test, and deploy new micro releases quickly.

All of this investment in testing and stabilizing Hadoop is now available to anyone.

Providing a robust foundation for other distributions, support, and solutions

This distribution is largely a response to the numerous requests that we have received to share Yahoo!'s internally tested and scale-proven releases. As the pace of Hadoop adoption has increased, so have requests for these releases. The Yahoo! Distribution of Hadoop provides a base for others to build their own distributions, commercial support, and solutions. I believe this will broaden the use of Hadoop and speed its development, growth, and quality, by which we will all benefit. To be clear, this is not a new business for Yahoo!. We will not be providing support or services for our distribution, but we hope that by releasing our internally tested version, third parties will build enterprise support and services on top of our distribution.

Providing all our patches under the Apache License

The pace of our internal releases and the demand for new features has required a number of features to be internally back-ported. With this release, we're committing to contribute back these internally back-ported features to the community and ensure all code in the Yahoo! Distribution of Hadoop is either in the Apache code repository or posted as patches in the Apache Hadoop community.

Hadoop is helping us solve key science and research problems in hours or days instead of months. It provides us a platform to solve extreme problems requiring massive amounts of data processing. It underpins major revenue-generating systems. Opening our distribution enables a faster pace of innovation for the entire Hadoop ecosystem and broadens the use — and ultimately the quality — of this key platform across the industry.

Go get it!

Nigel Daley
Quality and Release Engineering Manager
Yahoo! Grid Technologies