Developer Network Home - Help

Hadoop and Distributed Computing at Yahoo!: November 2007 Archives

Main | December 2007 »

Grid Computing Archive

November 30, 2007

Pig into Incubation at the Apache Software Foundation

A few weeks ago, a project called Pig went into incubation at the Apache Software Foundation.

Since you're probably scratching your head about what that sentence means, let me break it down for you. Pig is a project that began in Yahoo! Research and we're building an open source community to further develop it via the Apache Software Foundation (ASF). Right now it's in the initial phases of becoming a full-fledged project under the ASF umbrella. That's commonly referred to as incubation, since it is hosted by the Apache Incubator. If you'd like more details, check out the Pig Proposal on the Incubator wiki.

The Incubator project is the entry path into The Apache Software Foundation (ASF) for projects and codebases wishing to become part of the Foundation's efforts. All code donations from external organisations and existing external projects wishing to join Apache enter through the Incubator.

Great. So what's this Pig thing all about? I asked that question of Olga Natkovich, one of the Pig developers here at Yahoo.

Pig is a high-level language (PigLatin) for data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

In my mind, Pig is to Hadoop as SQL is to relational databases. It's the language and logic that'll open up access to a much wider audience of people: anyone who can write a query. Today you usually need to sit down write code to make use of the results from processing data on a Hadoop cluster. By building a robust query layer on top of Hadoop, the barrier gets quite a bit lower.

See Also: Yahoo Pig and Google Sawzall (Greg Linden)

Jeremy Zawodny
Yahoo! Developer Network

Posted by jzawodn at 8:15 AM | Comments (0)

November 14, 2007

Welcome to the YDN Hadoop & Distributed Computing Blog

Back in July, I wrote Open Source Distributed Computing: Yahoo's Hadoop Support to highlight the work that Yahoo has been doing with the Apache Hadoop distributed computing project. Since then, interest and activity around the project has grown, making it clear that people want to know more about it all. Folks like Tim O'Reilly took notice.

To make things a bit easier, we decided to start this Hadoop and Distributed Computing blog on the Yahoo! Developer Network as a place to write about Hadoop and our distributed computing work on a more regular basis. There's a lot going on (we already have a backlog of posts!) and we're anxious to get the word out.

A couple weeks ago, I sat down with Eric Baldeschwieler ("Eric14") as part of our Experts@Work series to learn more about Yahoo's involvement in Hadoop and the growing interest in distributed computing technology.


download video

That seemed like a fitting way to kick off this new blog.

I'm looking forward to spending more time with the Hadoop team here and exploring the potential of open source software powered by thousands of commodity servers. Our recent announcement with CMU is just the tip of the iceberg.

Jeremy Zawodny
Yahoo! Developer Network

Posted by jzawodn at 11:05 AM | Comments (1)

Copyright © 2008 Yahoo! Inc. All rights reserved.

Privacy Policy - Terms of Service - Copyright Policy - Job Openings

d