• Hi Folks,

    I'm pleased to announce that after some reflection, Yahoo! has decided to discontinue the "The Yahoo Distribution of Hadoop" and focus on Apache Hadoop. We plan to remove all references to a Yahoo distribution from our website (developer.yahoo.com/hadoop), close our github repo (yahoo.github.com/hadoop-common) and focus on working more closely with the Apache community. Our intent is to return to helping Apache produce binary releases of Apache Hadoop that are so bullet proof that Yahoo and other production Hadoop users can run them unpatched on their clusters.

    Until Hadoop 0.20, Yahoo committers worked as release masters to produce binary Apache Hadoop releases that the entire community used on their clusters. As the community grew, we experimented with using the "Yahoo! Distribution of Hadoop" as the vehicle to share our work. Unfortunately, Apache is no longer the obvious place to go for Hadoop releases. The Yahoo! team wants to return to a world where anyone can

    Read More »from [ANNOUNCEMENT] Yahoo focusing on Apache Hadoop, discontinuing “The Yahoo Distribution of Hadoop”
  • The Backstory of Yahoo and Hadoop

    Somewhat to my surprise, I was recently asked why Yahoo has put so much into Apache Hadoop. We currently have nearly 100 people working on Apache Hadoop and related projects, such as Pig, ZooKeeper, Hive, Howl, HBase and Oozie. Over the last 5 years, we've invested nearly 300 person-years into these projects. The Hadoop team at Yahoo is so passionate about our open source mission, and we've been doing this for so long, that we tend to assume that everyone understands our position. The recent evidence to the contrary motivates this post.

    Back in January 2006, when we decided to invest in scaling Hadoop from an interesting prototype to the robust scalable framework it is today, it was obvious that our direct competitors had or were building private implementations of map-reduce and clustered storage. We didn't believe that this type of infrastructure would bring sustainable advantage to any one competitor: the needs of Web Search at the time were driving everyone in in a similar

    Read More »from The Backstory of Yahoo and Hadoop
  • Hadoop User Group meeting recap, November 2010

    More than 100 Hadoop developers and enthusiasts congregated on the Yahoo campus for the monthly HUG meeting on November 17. As always, they were treated to some enlightening presentations in addition to good food and beverages.

    After the usual 30 minutes of socializing and networking, James Dixon, the CTO of Pentaho, kicked off the presentations with an interesting talk on "Business Intelligence for Big Data." He spoke about the current Hadoop use-cases and its limitations when it comes to traditional BI use-cases. He introduced the concept of "data lakes" and spoke in depth about Pentaho's approach for BI on Hadoop. He ended the presentation with an interesting demo.

    Here are the slides from Pentaho's talk.

    Following is the video of the presentation.

    This was followed by a talk on "Fuzzy Tables" by Ed Kohlwey from the strategy and technology consulting giant Booze Allen Hamilton. He introduced the audience to the concept of fuzzy matching and its application in the important field

    Read More »from Hadoop User Group meeting recap, November 2010
  • HUG November 17: Three talks and a beer

    The Bay Area Hadoop User Group (HUG) meets this Wednesday, November 17, at 6:00 PM at the Surf Cafe, Yahoo! Building E, 701 First Avenue, in Sunnyvale, CA. We expect around 50 Hadoop fans.

    We'll start off with socializing and beer. The first talk, at 6:30, is "Business Intelligence for Big Data," which will delve into the strengths and weaknesses of Hadoop for data transformation and reporting. Also on tap are examples of code-free data transformations in Hadoop and how to create a Hadoop/Hive/Datamart stack sans coding.

    At 7:00, we'll go into "Using Hadoop for Indexing for Biometric Data, High Resolution Images, Voice/Audio Clips, and Video Clips" and Fuzzy Table. Fuzzy Table is a distributed, low latency, fuzzy-matching database built over Hadoop that enables fast fuzzy searching of content that cannot be easily indexed or ordered.

    At 7:30, Ramkumar Vadali and Scott Chen, Facebook, talk about how HDFS Raid helps save disk space by reducing the number of replicas created for blocks.

    Read More »from HUG November 17: Three talks and a beer
  • YDN Blog

  • October HUG Recap

    Thanks to the 125 or so developers who came to Yahoo! recently for our monthly Hadoop User Group meeting, in spite of the Giants' World Series baseball game. The conversations continued before and long after the formal sessions.

    The event started with a presentation on state-of-the-art productivity tools for developers and analysts by Ben “Shevek” Mankin, from Karmasphere. Shevek talked about the challenges developers face when interacting with Hadoop programs — MapReduce, Hive, and so on. He demonstrated how to debug big data jobs on Hadoop from your desktop using familiar graphical interfaces. The graphical IDE certainly improves engineering productivity and accelerates the Hadoop job development process.

    Next we had Marc Limotte talk about how Cascalog fills a need as an internal domain-specific language for map/reduce jobs. Marc talked about about how Cascalog differs from other options for creating map/reduce programs and the advantages of this option.

    Finally, we are always

    Read More »from October HUG Recap
  • September HUG Recap

    Thanks to the around 200 developers who came to Yahoo! recently for our monthly Hadoop User Group meeting. The energy in the packed room was phenomenal, and conversations continued long after the formal sessions.

    The event started with Chris Riccomini talking about Pig at LinkedIn. It was great to get a firsthand view of how industry is leveraging the power of Pig to solve their data processing problems. Chris covered how Pig is an integral part of data analytics at LinkedIn. He showed how Pig is used to design, develop, and deliver data products at LinkedIn. He explored a successful example of Pig deployment at LinkedIn, pain points, and integration with Azkaban, Voldemort, Hadoop, and the rest of LinkedIn’s ecosystem. Chris also covered the most frequent gottcha's and learnings, and then concluded with some of his thoughts on the evolution of Pig. The talk generated many interesting questions.

    Next we had Dhruba Borthakur and Dmytro Molkov from Facebook talking about

    Read More »from September HUG Recap
  • Hadoop on Apache.org

  • Wiki

  • Yahoo! Cloud Virtual Machine Appliance

    At Yahoo!, we recently implemented a stronger notion of security for the Hadoop platform, based on Kerberos as underlying authentication system. We also successfully enabled this feature within Yahoo! on our internal data processing clusters. I am sure many Hadoop developers and enterprise users are looking forward to get hands-on experience with this enterprise-class Hadoop Security feature.

    In the past, we've helped developers and users get started with Hadoop by hosting a comprehensive Hadoop tutorial on YDN, along with a pre-configured single node Hadoop (0.18.0) Virtual Machine appliance.

    This time, we decided to upgrade this Hadoop VM with a pre-configured single node Hadoop 0.20.S cluster, along with required Kerberos system components. We have also included PPig (version 0.7.0), a high level SQL-like data processing language used at Yahoo! and Oozie (version 2.2.0), an open-source workflow solution to manage and coordinate jobs running on Hadoop, including HDFS, Pig, and

    Read More »from Yahoo! Cloud Virtual Machine Appliance


(104 Stories)