• Hadoop Test-Related Issues

    I'm getting together with some of the Hadoop committers tomorrow. Considering my quality engineering background, these are some of the discussion items at the top of my mind for the project:

    1. Code Review Guidelines: I wrote these up a couple years ago. Are they being followed? Are they the right set? How can we raise the quality of the code reviews being performed before patches are committed?
    2. Feature Design Documentation: Can we agree that each feature needs a design doc? A proposed template is attached to HADOOP-5587.
    3. Feature Test Plans: Can we agree that each feature needs a test plan? A proposed template is also attached to HADOOP-5587.
    4. Warnings: We're working to reduce static analysis (Findbugs), compiler (javac), and documentation (javadoc) warnings to zero. Can we commit to keeping them there?
    5. Fault Injection Framework: We're working on a fault-inject framework so that my team and others can write tests that inject faults and monitor the effects. The current work is being
    Read More »from Hadoop Test-Related Issues
  • Announcing the Yahoo! Distribution of Hadoop

    Today we're announcing the general availability of the Yahoo! Distribution of Hadoop, a source-only distribution of Apache Hadoop that we deploy here at Yahoo!.

    In my role as quality and release engineering manager for grid technologies at Yahoo!, including Hadoop, I'm really excited about what this release means for the larger Hadoop ecosystem. Here's why:
    1. We're opening up the results of our investment in quality engineering and scale deployments to the Apache Hadoop community and surrounding ecosystem.
    2. We're publishing a frequent source distribution that provides a robust foundation on which others can build and deploy their own enterprise distributions, support, and solutions.
    3. We're committing to keep all of our source code changes for our distributions available as patches in the Apache Hadoop community.

    Opening our investment in quality engineering and scale deployments

    We spend thousands of machine hours to test each release of Hadoop that we deploy internally. We run Read More »from Announcing the Yahoo! Distribution of Hadoop
  • Hadoop computes the 10^15+1st bit of π

    I used Yahoo's Hadoop clusters to compute the 1,000,000,000,000,001st bit of π. The 7 hexadecimal digits of π starting at the 10^15+1 bit are:


    Although Hadoop is primarily used for data-intensive applications, it can also be used to run CPU-intensive jobs on many machines. Computing a range of bytes in π using a BPP-type formula, requires a lot of arithmetic operations and therefore CPU, but not much storage. When computing the 10^15+1st bit of π, the first 30% of the computation was done in idle slots of our Hadoop clusters spread over 20 days. The remaining 70% was finished over a weekend on the Hammer cluster, which was also used for the petabyte sort benchmark.

    This validates the results calculated by PiHex, which took more than 2 years on 1734 computers from 56 different countries.

    My program was written entirely in Java and ran on Hadoop 0.20. An earlier version is checked in as a Hadoop example named BaileyBorwinPlouffe. The new code will be uploaded soon.

    -- Tsz Wo

    Read More »from Hadoop computes the 10^15+1st bit of π
  • We used Apache Hadoop to compete in Jim Gray's Sort benchmark. Jim's Gray's sort benchmark consists of a set of many related benchmarks, each with their own rules. All of the sort benchmarks measure the time to sort different numbers of 100 byte records. The first 10 bytes of each record is the key and the rest is the value. The minute sort must finish end to end in less than a minute. The Gray sort must sort more than 100 terabytes and must run for at least an hour. The best times we observed were:

    Bytes Nodes Maps Reduces Replication Time
    500,000,000,000 1406 8000 2600 1 59 seconds
    1,000,000,000,000 1460 8000 2700 1 62 seconds
    100,000,000,000,000 3452 190,000 10,000 2 173 minutes
    1,000,000,000,000,000 3658 80,000 20,000 2 975 minutes

    Within the rules for the 2009 Gray sort, our 500 GB sort set a new record for the minute sort and the 100 TB sort set a new record of 0.578 TB/minute. The 1 PB sort ran after the 2009 deadline, but improves the speed to 1.03 TB/minute. The 62 second

    Read More »from Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Seconds
  • Hadoop Summit 2009 – Open for registration

    This year’s Hadoop Summit is confirmed for June 10th at the Santa Clara Marriott, and is now open for registration.

    We have a packed agenda, with three tracks – for developers, administrators, and one focused on new and innovative applications using Hadoop. The presentations include talks from Amazon, IBM, Sun, Cloudera, Facebook, HP, Microsoft, and the Yahoo! team, as well as leading universities including UC Berkeley, CMU, Cornell, U of Maryland, U of Nebraska and SUNY.

    From our experience last year with the rush for seats, we would encourage people to register early.

  • In the last release I took on the task of setting up a true system test environment for Apache ZooKeeper. Our previous environment ran the system test in a single JVM instance, which meant that there were some test scenarios that we just couldn't reproduce. In this new environment we wanted to be able to run tests across multiple hosts and deal with different numbers of machines and cluster environments.

    My first attempt used ssh and scripts to fire off servers and clients on a cluster of machines. I soon realized that there was a significant amount of hardcoded configuration and envirnment
    assumptions that would make the setup inflexible and impractical. So, I started over from scratch.

    For the system test I wanted to use some set of machines, M, to host some number of ZooKeeper servers, S, and some number of clients C. Ideally the system test could just discover M at runtime, startup S servers on some subset of M and then startup C clients on the rest. I realized this is a perfect

    Ben Reed
    YahooRead More »from Using ZooKeeper to tame system test for large-scale services

  • Pig Training Available Online

    Yahoo! and Cloudera have worked together to produce a couple of training videos for Pig. There is Introduction to Pig, a 50 minute talk on Pig, including copious examples of writing Pig Latin scripts, an overview of how Pig works, and a discussion of the advantages of Pig versus other Hadoop interfaces. There is also a pig tutorial, a 15 minute screen cast that walks through running a series of Pig programs. These two provide a great starting point for those new to Pig and anyone trying to understand if Pig is a good fit for their data processing needs.

    Alan Gates, a Pig committer and member of the Yahoo! Pig development team, delivered the training at Cloudera, as part of their Hadoop training series.

  • Hadoop at ApacheCon EU

    ApacheCon EU 2009 is in Amsterdam this week. Similar to the Hadoop Camp at ApacheCon US last year in NewOrleans, there is a full day of Hadoop talks scheduled. Live video streaming of the Hadoop track is available for a fee from the ApacheCon EU site.

    The Hadoop track this year is on Wednesday March 25th and includes:

    * Opening Keynote - Data Management in the Cloud - Raghu Ramakrishnan
    * Introduction to Hadoop - Owen O'Malley
    * Hadoop Map-Reduce: Tuning and Debugging - Arun Murthy
    * Pig: Making Hadoop Easy - Olga Natkovich
    * Running Hadoop in the Cloud - Tom White
    * Configuring Hadoop for Grid Services - Allen Wittenauer
    * Dynamic Hadoop Clusters - Steve Loughran

    We will end the day with a birds of a feather session with the Yahoo! Hadoop team.

    Read More »from Hadoop at ApacheCon EU
  • Apache ZooKeeper: the making of

    After working for a long time on many different distributed systems, you learn to appreciate how complicated the littlest things can become.
    For example, when running an application on a local machine, changing configuration of an application involves clicking on a gui or, at worst, editing a file and restarting the app.
    However, distributed applications run on different machines and need to see configuration changes and react to them.
    To make matters worse, machines may be temporarily down or partitioned from the network.
    Not only do these outages make things hard to configure, but they also make application health no longer a choice between dead or alive; you also have mostly alive or dead and the dreaded half dead.
    Robust distributed applications also have the ability to incorporate new machines or decommission machines on the fly.
    Partial failures together with elastic machine ensembles mean that even the configuration of the distributed application should be dynamic.
    To make

    Read More »from Apache ZooKeeper: the making of
  • Using Hadoop to fight spam – Part II

    Hadoop helps Yahoo! Mail reduce spam for over 300 million people. In the second part of their talk, Mark Risher and Jay Pujara, Yahoo! Mail technologists and spam cops, describe how they are using Hadoop to fight botnets.

    Mark and Jay explain how Hadoop makes it possible for Yahoo! Mail to quickly analyze huge sets of data to identify where spam comes from. They describe how team members can quickly come up to speed and submit queries on the data using Pig, and how they see this effort evolving. Listen in as they describe their experiences:


(104 Stories)