Hadoop at Yahoo!: More Than Ever Before

Hadoop ElephantHadoop ElephantA lot has changed at Yahoo! last year. We have new leaders, we gained millions in new audience*, we saw engagement gains from Social Bar, and we released several successful mobile apps such as Flickr and Yahoo! Mail. But with all that change, there is one thing that has remained constant, and that is our commitment to pioneering new ground for Hadoop.

I was well aware of the rich legacy behind Hadoop at Yahoo! when I started in the Cloud Engineering Group about eight months ago. What I was perhaps not fully aware of was the talent and energy of our engineering team (watch the 2 min Hadoop Summit 2012 video), a team eager to push the scale and efficiency boundaries of Hadoop for delivering tangible business results for Yahoo!. We have really come together as a customer-focused group with tight alignment on our strategy, vision, and roadmap with continued commitment to stay true to Apache Software Foundation and contribute 100% of our development work back into the community.

Hadoop at Yahoo!

Hadoop at Yahoo!Hadoop at Yahoo!

In 2012, we stabilized Hadoop 0.23 (a branch very close to Hadoop 2.0, less the HDFS HA enhancements), validated hundreds of user feeds and thousands of applications, and rolled it out on tens of thousands of production nodes. The rollout is expected to complete fully in Q1 2013, and is a testimony to what we stated earlier, our commitment to pioneering new ground for Hadoop. To give you an idea, we have run over 14 million jobs on YARN (Nextgen MapReduce for Apache Hadoop) and average more than 80,000 jobs on a single cluster per day on Hadoop 0.23. In addition, we made sure that the other Apache projects like Pig, Hive, Oozie, HCatalog, and HBase run on top of Hadoop 0.23. We also stood up a near real-time scalable processing and storage infrastructure in a matter of few weeks with MapReduce/YARN, HBase, ZooKeeper, and Storm clusters to enable the next generation of Personalization and Targeting services for Yahoo!.

As the largest Hadoop user and a major open source contributor, we have continued our commitment to the advancement of Hadoop through co-hosting Hadoop Summit 2012 and sponsoring Hadoop World + Strata Conference, 2012 in NY. We continue to sponsor the monthly Bay Area Hadoop User Group meetup (HUG), one of the largest Hadoop meetups anywhere in the world, running into its fourth year now at the URL’s café of our Sunnyvale campus.

Yahoo at Hadoop Summit 2012 and Hadoop World 2012Yahoo at Hadoop Summit 2012 and Hadoop World 2012

While we are focused on solving the current and future use cases for Yahoo!, we are working on a number of exciting enhancements across various projects in the Apache Hadoop ecosystem. In 2013, we plan to stabilize and rollout Hadoop 2.0 on our grid infrastructure and introduce secure and multi-tenant HBase and Low Latency Data Processing services for our customers. We are also developing and deploying an integrated technology stack using GDM (managed-service for data lifecycle management), HCatalog, Oozie, and Cloud Messaging (a hosted scalable messaging service for distributed applications) for building efficient large-scale data processing pipelines (across audience, search, and advertising data) on top of native Hadoop technologies, including a data-out interface for HCatalog for off-grid customers to enable data discovery and data downloads directly from Hadoop clusters. We also plan to enable an even broader access to data on our clusters through adhoc queries issued directly from a number of BI tools for reporting and visualization purposes.

Today, we boast 6 PMC members and over 20 committers across various projects in the Apache Hadoop ecosystem. Hadoop at Yahoo! presents unique opportunities and technical challenges for our engineers to not only develop cutting edge technologies and software but also see their work in action right away deployed as a service at scale on the largest Hadoop footprint in the world that processes 100B events every day from Yahoo!’s 700 million users. Nowhere do we see such an opportunity to make an impact on the only enterprise-scale platform that we trust to run our business across the globe.

Hadoop Promo ImageHadoop Promo ImageYahoo! is committed to Hadoop more than ever before. We are excited about our 2013 plan and look forward to sharing more as we go through yet another year of innovation with Hadoop. We are growing and hiring, and encourage you to browse through open opportunities in the Hadoop group and apply. Please stop by our kiosks at Hadoop Summit Europe in March 2013 (special pricing for YDN community here), Hadoop Summit North America in June, 2013, or attend one of our monthly HUG meetups to find out more about what we are up to.

* Source: comScore, Unique Visitors WW Dec 2011: 702 million, Dec 2012: 735 million