• HCatalog, tables and metadata for Hadoop

    Last month the HCatalog project (formerly known as Howl) was accepted into the Apache Incubator. We have already branched for a 0.1 release, which we hope to push in the next few weeks. Given all this activity, I thought it would be a good time to write a post on the motivation behind HCatalog, what features it will provide, and who is working on it.

    Why Did We Create HCatalog?

    Out of the box Hadoop provides the HDFS file system for users to store their data. File systems are nice because they provide a simple interface. Users can easily copy data into the file system and run jobs against that data. However, for more complex data processing tasks, the file system abstraction is not rich enough. It forces users to know where data is located, what format it is stored in, how it is compressed, and what its schema is. Consider, for example, a Pig Latin script used to do ETL on raw web logs:

    A = load '/data/raw/ds=20110225/region=us/property=news' using PigStorage()

    Read More »from HCatalog, tables and metadata for Hadoop
  • Hadoop User Group meeting recap, March 2011

    More than 200 Hadoop developers and enthusiasts congregated on the Yahoo campus for the monthly HUG meeting on March 16-Th. As always, they were treated to some enlightening presentations in addition to good food and beverages.

    After the usual 30 minutes of socializing and networking, Milind Bhandarkar from LinkedIn, kicked off the evening with a really enlightening talk on "Scaling Hadoop Applications." As a well-respected Hadoop expert and a founding member of the Hadoop team at Yahoo in 2005, Milind was able to articulate the issues and solutions very succinctly. His talk was especially interesting because he tied well known theorems and laws around scalability to the ground realities on the Hadoop clusters today.

    Here are the slides from Milind's talk.

    Following is the video of the presentation.

    This was followed by an interesting talk on "HDFS Federation" by Yahoo's Suresh Srinivas. HDFS Federation is a major feature slated to come out in the

    Read More »from Hadoop User Group meeting recap, March 2011
  • Hadoop Summit 2011 – Registration Now Open!

    Calling all Hadoopers

    Yahoo! is pleased to announce that this year’s Hadoop Summit is scheduled for June 29th at the Santa Clara Convention Center. Registration for the event is now open and offers an early bird special of $125, a savings of nearly 30% on the full ticket price of $175. This ends on May 1st, so register now to take advantage of this great offer.

    Whether you are already running and managing a Hadoop installation, developing Hadoop-based applications or exploring how to adopt Apache Hadoop for your business, the summit provides a unique opportunity to gain deep insights into the world of Hadoop from the company that pioneered it. Learn about interesting and relevant real-world applications and find out about the latest Big Data research.

    The summit brings together some of the most influential speakers in the Hadoop space. Our full agenda provides many informative tracks for developers, administrators, managers and researchers. A

    Read More »from Hadoop Summit 2011: June 29th, Santa Clara Convention Center
  • Apache Hadoop Innovation AwardThe Hadoop project won the top MediaGuardian Innovation award.

    A groundbreaking open source project has won the top prize at the 2011 MediaGuardian Innovation Awards.

    The judging panel described the Apache Hadoop project as the Swiss army knife of the 21st Century, and having the potential to completely change the face of media innovations across the globe. Overall, the project was seen as a greater catalyst for innovation than WikiLeaks, the iPad and a host of other suggested nominees.

    All of the Hadoop contributors should be very proud of this award. Sanjay Radia, Jakob Homan, and I attended in person as members of the Hadoop Project Management Committee to receive the award on behalf of the project.

    I've been working on Hadoop full time since the beginning and it has been a pleasure working with such bright and dedicated engineers. It takes a village to raise an elephant from a prototype that runs on a few nodes to the project that is disrupting the big data industry.

    Read More »from Apache Hadoop project wins MediaGuardian Innovation award
  • ## Introduction

    The previous post in this series covered the next generation of Apache Hadoop MapReduce in a broad sense, particularly its motivation, high-level architecture, goals, requirements, and aspects of its implementation.

    In the second post in a series unpacking details of the implementation, we’d like to present the protocol for resource allocation and scheduling that drives application execution on a Next Generation Apache Hadoop MapReduce cluster.

    ## Background

    Apache Hadoop must scale reliably and transparently to handle the load of a modern, production cluster on commodity hardware. One of the most painful bottlenecks in the MapReduce framework has been the JobTracker, the daemon responsible not only for tracking and managing machine resources across the cluster, but also for enforcing the execution semantics for all the queued and running MapReduce jobs. The fundamental shift we hope to effect takes these two complex and interrelated concepts and re-factors them into

    Read More »from Next Generation of Apache Hadoop MapReduce – The Scheduler
  • I’ll Take Hadoop for $400, Alex

    See what Yahoo! and Jeopardy! have in common.

    Watson playing Jeopardywatson-jeopardyThis week,
    IBM’s supercomputer, Watson (named after IBM’s founder, Thomas J. Watson), took on two of the most championed Jeopardy! contestants of all time in an exhilarating million-dollar Jeopardy! face-off between man and machine.

    Watson defeated Jeopardy! defenders Ken Jennings and Brad Rutter, amassing $77,147 in winnings in a nail-biting three-night tournament that sparked interest around the field of artificial intelligence and data analytics.

    What you may not realize is that Yahoo! played a role in it.

    IBM's Watson depends on 200 million pages of content and 500 gigabytes of preprocessed information to answer the Jeopardy questions. That huge catalog of documents had to be indexed so that Watson could answer questions within the 3 second time limit. On a single computer, generating that large catalog and index would take a lot of time, but dividing the work on to many computers makes it much faster.

    Apache Hadoop is the industry

    Read More »from I’ll Take Hadoop for $400, Alex
  • Hadoop User Group (HUG) February 2011 recap

    Crowd at Feb 2011 HUG2011-feb-crowdWe had a record turnout for the February 2011 Hadoop User Group at the main Sunnyvale Yahoo! campus with 336 people signed up. Next month and for the rest of the year, we'll be in the larger Yahoo! cafeteria across the street that can hold up to 1000 people. If I remember correctly, the first Hadoop Summit in 2008 was only 400 people.

    We started with an introduction and discussion of Yahoo's plans to shutdown its Hadoop github repository and ensure that all of its work is committed to Apache Hadoop. In particular, we are committing our battle-tested releases into an Apache branch and will make Apache releases from the branch.

    My introduction slides are below:

    Next Generation MapReduce

    Arun C. Murthy presented the plans for the next generation of Apache Hadoop MapReduce. The MapReduce framework has hit a scalability limit around 4,000 machines. We are developing the next generation of MapReduce that factors

    Read More »from Hadoop User Group (HUG) February 2011 recap
  • The Next Generation of Apache Hadoop MapReduce

    ## Overview

    In the Big Data business running fewer larger clusters is cheaper than running more small clusters. Larger clusters also process larger data sets and support more jobs and users.

    The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines. We are developing the next generation of Apache Hadoop MapReduce that factors the framework into a generic resource scheduler and a per-job, user-defined component that manages the application execution. Since downtime is more expensive at scale high-availability is built-in from the beginning; as are security and multi-tenancy to support many users on the larger clusters. The new architecture will also increase innovation, agility and hardware utilization.

    ## Background

    The current implementation of the Hadoop MapReduce framework is showing it’s age.

    Given observed trends in cluster sizes and workloads, the MapReduce JobTracker needs a drastic overhaul to address several deficiencies in its scalability,

    Read More »from The Next Generation of Apache Hadoop MapReduce
  • The Hadoop Map-Reduce Capacity Scheduler

    Arun C. Murthy
    Lead, Hadoop Map-Reduce Development Team, Yahoo

    This blog post describes the Capacity Scheduler,
    a pluggable MapReduce scheduler for Apache Hadoop, which allows for
    multiple-tenants to securely share a large cluster such that their applications
    are allocated resources in a timely manner under constraints of allocated


    We have developed and deployed the
    Capacity Scheduler on over 40,000 Hadoop machines at Yahoo since 2008.


    Please note that some of the features
    described in this post are currently available only in the Apache Hadoop
    href="http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-security/"> style='font-size:10.0pt;font-family:"serif";'>0.20-security
    branch and we are feverishly working to port
    to Apache Hadoop trunk as part of our

    Read More »from The Hadoop Map-Reduce Capacity Scheduler
  • Hadoop User Group (HUG) January 2011 Recap

    I have a new deadline to beat for this post. Wednesday February 16, 2011 6 pm, that's my new deadline. That’s when we hold the next HUG at the Yahoo! Sunnyvale Campus at the URL’s Café. BTW, in case you are looking for it, here’s the campus in more details: http://www.wikimapia.org/#lat=37.4181633&lon=-122.0250607&z=18&l=0&m=b&search=yahoo

    Why is that my new deadline? Mostly, because I missed my old deadline which was a commitment to the 200 or so Hadoopers that attended the January HUG session that the presentations from the January HUG would be made available soon on the Hadoop blog on YDN, here: https://developer.yahoo.com/blogs/hadoop/

    If I don’t beat this new deadline, I have this picture in my mind of the same 200 Hadoopers that showed up at the January HUG showing up at the February HUG surrounding me to remind me of my commitment to share the slides from the January HUG.

    So, here are the Wednesday January 19, 2011 HUG presentations in order of appearance:

    New features in Pig 0.8

    Read More »from Hadoop User Group (HUG) January 2011 Recap


(104 Stories)