• Using Hadoop to fight spam – Part 1

    We interviewed Mark Risher and Jay Pujara, leaders in the war against spam for Yahoo! Mail. With over 300 million users and billions of mesages, looking for problems or patterns to identify spammers can be a daunting task. Mark and Jay describe how their previous approach using databases quickly ran into scalability limitations as they analyzed data aggregated over a month or more. They explain how Hadoop, with Pig and Streaming, now enables them to slice through billions of messages to isolate patterns and identify spammers. They can now create new queries and get results within minutes, for problems that took hours or were considered impossible with their previous approach. Listen in as Mark and Jay describe their experiences fighting spam:

    Read More »from Using Hadoop to fight spam – Part 1
  • Hadoop User Group

    Hadoop User Group meetings have now been held in Beijing, Berlin, London, New York, San Diego and Washington DC, in addition to the Bay Area, with one in the works in Bangalore. In the Bay Area, we typically host them on the third Wednesday of each month at the Yahoo! campus in Santa Clara.

    The meeting last week featured Matei Zaharia from UC Berkeley talking about the Fair Scheduler for Hadoop. The need for a scheduler has been a known requirement for quite a while, and Matei got started working on this while he was an intern at Facebook. His talk described their goals of providing fast response time for small jobs and guaranteed SLA’s for production jobs. It then discussed the concept of pools, the scheduling algorithm for assigning resource capacity, as well as installation, configuration and administration of the scheduler.

    This was followed by a talk from Aaraon Kimball from Cloudera on Importing Data from MySQL which discussed techniques for loading data from databases into HDFS.

    Read More »from Hadoop User Group
  • Pig started as a research project within Yahoo! in the summer of 2006. The original prototype quickly became very popular with users. It was clear that a higher level language than raw map-reduce was needed to quickly rollout prototypes as well as to build production quality applications. Early adopters within Yahoo! have reported substantial increases in productivity when they migrated from raw map-reduce to Pig.

    In the summer of 2007 a team was put together to make the project into a product. Working within an open source community was perceived as one of the important early goals of the project. Pig has been part of the open source community for over a year, joining Apache Incubator in September of 2007. During this time Pig has developed a community of users and developers, and added two new committers. It also gained wide popularity within Yahoo! with 30% of all Hadoop jobs using Pig - which amounts to thousands per day!

    A lot of great technical work went into the project which

    Read More »from Pig – The Road to an Efficient High-level language for Hadoop
  • Hadoop User Group Meeting

    In response to a number of requests from folks outside the Bay Area to have us record and post the Hadoop User Group presentations, here are the talks from the October meeting which was held this week at the Yahoo! Mission College campus.

    We had Jun Rao from IBM Almaden Research talk about “Exploiting database join techniques for analytics with Hadoop”. This was followed by an update on Jaql by Kevin Beyer from IBM, who informed us that Jaql is now available as Open Source. The last talk was a lively discussion with Sriram Rao from Quantcast about his “Experiences moving a Petabyte Data Center”.

    Bay Area Hadoop User Group meetings are usually held on the third Wednesday of each month at Yahoo! Mission College in Santa Clara.

    Ajay Anand
    Yahoo! Grid Computing

  • Hadoop Camp at ApacheCon

    Following up on the interest in the Hadoop Summit which we held a few months ago, we got together with the ApacheCon folks to arrange a Hadoop Camp at their conference this year.

    Hadoop Camp will be held on November 6th and 7th in New Orleans as part of ApacheCon this year. Along the lines of the summit, we have speakers from some of the leading companies developing on and using Hadoop, including Facebook, Amazon, IBM, Hewlett-Packard, Sun, Powerset, and Yahoo! in what is possibly the largest gathering of Hadoop committers, developers and users outside of the Bay Area.

    In addition to the Camp, there is a Hadoop tutorial on Monday, November 3rd, and we are also looking into coordinating a Hadoop “hack” contest that would run through the week at ApacheCon.

    We are looking forward to a strong turnout!

    Ajay Anand

    Read More »from Hadoop Camp at ApacheCon
  • Scaling Hadoop to 4000 nodes at Yahoo!

    We recently ran Hadoop on what we believe is the single largest Hadoop installation, ever:

    • 4000 nodes
    • 2 quad core Xeons @ 2.5ghz per node
    • 4x1TB SATA disks per node
    • 8G RAM per node
    • 1 gigabit ethernet on each node
    • 40 nodes per rack
    • 4 gigabit ethernet uplinks from each rack to the core (unfortunately a misconfiguration, we usually do 8 uplinks)
    • Red Hat Enterprise Linux AS release 4 (Nahant Update 5)
    • Sun Java JDK 1.6.0_05-b13
    • So that's well over 30,000 cores with nearly 16PB of raw disk!

    The exercise was primarily an effort to see how Hadoop works at this scale and gauge areas for improvements as we continue to push the envelope. We ran Hadoop trunk (post Hadoop 0.18.0) for these experiments.

    Scaling has been a constant theme for Hadoop: we, at Yahoo!, ran a modestly sized Hadoop cluster of 20 nodes in early 2006; currently Yahoo! has several clusters around the 2000 node mark.


    The scaling issues have always been the main focus in designing any HDFS feature.

    Read More »from Scaling Hadoop to 4000 nodes at Yahoo!
  • Hadoop 0.18 Highlights

    Apache Hadoop 0.18 was released on 8/22. This is the largest Hadoop release to date in terms of the number of patches committed (266). It also has the largest percentage of patches (20%) from contributors outside of Yahoo!. This is a great indicator of both the growth of the Hadoop community and their increasing involvement in the projects progress. The size of the release resulted in a very large number of blocking bugs in the code base. Unfortunately, this created a big delay between the feature freeze on 6/4 and the final release.

    Hadoop 0.18 has many improvements in the areas of performance, scalability and reliability in addition to new features. Some of the performance improvements contributed to Hadoop’s first place in the terabyte sort benchmark. Hadoop 0.18 runs the grid mix benchmark in ~45% of the time taken by Hadoop 0.15. Lots of cool new stuff in this release, some of which is briefly described below.


    Namespace auto-recovery
    The HDFS Namenode can store the

    Read More »from Hadoop 0.18 Highlights
  • Apache Hadoop Wins Terabyte Sort Benchmark

    One of Yahoo's Hadoop clusters sorted 1 terabyte of data in 209 seconds, which beat the previous record of 297 seconds in the annual general purpose (daytona) terabyte sort benchmark. The sort benchmark, which was created in 1998 by Jim Gray, specifies the input data (10 billion 100 byte records), which must be completely sorted and written to disk. This is the first time that either a Java or an open source program has won. Yahoo is both the largest user of Hadoop with 13,000+ nodes running hundreds of thousands of jobs a month and the largest contributor, although non-Yahoo usage and contributions are increasing rapidly.

    The cluster statistics were:

    • 910 nodes
    • 2 quad core Xeons @ 2.0ghz per a node
    • 4 SATA disks per a node
    • 8G RAM per a node
    • 1 gigabit ethernet on each node
    • 40 nodes per a rack
    • 8 gigabit ethernet uplinks from each rack to the core
    • Red Hat Enterprise Linux Server Release 5.1 (kernel 2.6.18)
    • Sun Java JDK 1.6.0_05-b13

    The benchmark was run with Hadoop trunk (pre-0.18)

    Read More »from Apache Hadoop Wins Terabyte Sort Benchmark
  • Hadoop 0.17 Preview

    Apache Hadoop 0.17 is due for release any day now. Feature freeze for the release was on April 4th. The Hadoop dev community is currently actively fixing blocking issues discovered by users that have tried it out. This is a release we’re very excited about as it introduces many long awaited performance fixes to the platform. We’ve observed on the order of 30%(!) improvement in the runtime of some of the Hadoop benchmarks. As always, user feedback is invaluable and we urge folks to kick the tires on the release and help close it out. Here is a quick rundown of the important changes in the release.



    • Syntax cleanup of Hadoop fs shell commands

      The syntax of many Hadoop fs shell commands has been revised. The goals have syntax consistency across commands and compatibility with POSIX syntax as far as possible. For examples, see HADOOP-1677, HADOOP-1792, HADOOP-1891
    • Block placement changes

      HDFS’ block placement strategy is now more tuned towards evenly distributing data across nodes.
    Read More »from Hadoop 0.17 Preview
  • VIM Color Syntax Highlighting for Pig

    I joined the Yahoo! Research Engineering group a few weeks ago, and I was literally blown away with the possibilities that Hadoop and Pig open for me. Immediately, I wanted to hack up something good to say thank you to all smart people that build and support such a great software.

    I am convinced that Pig deserves more respect from the major text editors, so I wrote a small vim script that adds syntax highlighting for Pig files.

    You can download it from vm.org site.

    To install, follow instructions on the web page, and don't forget to vote! :-)

    Emacs version is coming up soon (yes, I use both vim *and* emacs). It will be my project for the upcoming Yahoo! Hack Day.

    Sergiy Matusevych
    Yahoo! Research Engineer


(104 Stories)