Hadoop User Group (HUG) February 2011 recap

Crowd at Feb 2011 HUG2011-feb-crowdWe had a record turnout for the February 2011 Hadoop User Group at the main Sunnyvale Yahoo! campus with 336 people signed up. Next month and for the rest of the year, we'll be in the larger Yahoo! cafeteria across the street that can hold up to 1000 people. If I remember correctly, the first Hadoop Summit in 2008 was only 400 people.

We started with an introduction and discussion of Yahoo's plans to shutdown its Hadoop github repository and ensure that all of its work is committed to Apache Hadoop. In particular, we are committing our battle-tested releases into an Apache branch and will make Apache releases from the branch.

My introduction slides are below:

Next Generation MapReduce

Arun C. Murthy presented the plans for the next generation of Apache Hadoop MapReduce. The MapReduce framework has hit a scalability limit around 4,000 machines. We are developing the next generation of MapReduce that factors the framework into a generic resource scheduler and a per-job, user-defined component that manages the application execution. Since downtime is more expensive at scale high-availability is built-in from the beginning; as are security and multi-tenancy to support many users on the larger clusters. The new architecture will also increase innovation, agility and hardware utilization.

The slides are here:

The video of his talk is here:

Next Generation Hadoop Operations at Facebook

Hadoop's traditional role as a framework for batch-oriented execution of MapReduce jobs is rapidly expanding to include many other use cases, such as Hbase, Scribe, and low latency ad-hoc queries of large datasets. Downtime is becoming less acceptable, and existing MapReduce jobs continue to get larger, with tighter expectations around completion time. Storage and data retention requirements continue to grow. Quite simply, it is both an amazing and extremely challenging time to be a Hadoop administrator.

Andrew Ryan presented the challenges facing Facebook's operations team in 2011 growing and managing a variety of Hadoop clusters throughout Facebook, and the solutions they are developing to address them. He shared their key best practices and lessons learned, and how they can be applied to any organization.

Andrew's slides are here:

The video of Andrew's presentation is here: