<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
   <channel>
      <title>Hadoop and Distributed Computing at Yahoo!</title>
      <link>http://developer.yahoo.com/blogs/hadoop/</link>
      <description>News and information about Hadoop and related distributed computing work going on at Yahoo!</description>
      <language>en</language>
      <copyright>Copyright 2008</copyright>
      <lastBuildDate>Mon, 28 Apr 2008 14:13:58 -0800</lastBuildDate>
      <generator>http://www.sixapart.com/movabletype/</generator>
      <docs>http://blogs.law.harvard.edu/tech/rss</docs> 

            <item>
         <title>Hadoop 0.17 Preview</title>
         <description><![CDATA[Apache Hadoop 0.17 is due for release any day now. Feature freeze for the release was on April 4th. The Hadoop dev community is currently actively fixing blocking issues discovered by users that have tried it out. This is a release we’re very excited about as it introduces many long awaited performance fixes to the platform. We’ve observed on the order of 30%(!) improvement in the runtime of some of the Hadoop benchmarks. As always, user feedback is invaluable and we urge folks to kick the tires on the release and help close it out. Here is a quick rundown of the important changes in the release.

<h4>HDFS</h4>

<p>&nbsp;</p>

<ul>
<li><em>Syntax cleanup of Hadoop fs shell commands</em><br/>
The syntax of many Hadoop fs shell commands has been revised. The goals have syntax consistency across commands and compatibility with POSIX syntax as far as possible. For examples, see <a href="http://issues.apache.org/jira/browse/HADOOP-1677">HADOOP-1677</a>, <a href="http://issues.apache.org/jira/browse/HADOOP-1792">HADOOP-1792</a>, <a href="http://issues.apache.org/jira/browse/HADOOP-1891">HADOOP-1891</a></li>
<li><em>Block placement changes</em><br/>
HDFS’ block placement strategy is now more tuned towards evenly distributing data across nodes. When a block is written from a node that is not running a Datanode, the first replica is written to a node on the same switch as the writer (instead of being written to the same node as the writer). The remaining two replicas are written to two nodes on a different switch. This strategy produces substantially better distributions in cases where data is loaded into HDFS from a small number of machines. More details at <a href="http://issues.apache.org/jira/browse/HADOOP-2559">HADOOP-2559</a>.</li>
<li><em>Append... getting closer</em><br/>
Everyone’s favorite bug, <a href="http://issues.apache.org/jira/browse/HADOOP-1700">HADOOP-1700</a>, is still not closed, but a lot of progress has been made in this release. Append related work will make possible new flush() and sync() methods in the HDFS interface. The semantics are familiar. The flush() call plonks data into the ‘system buffers’ and returns immediately. The sync() call writes the data to the system, and returns only when it has hit disk.</li>
<li>More efficient replication<br/>
Losing a rack of nodes usually means that the Namenode has to replicate around half a million blocks, and quickly. This would cause Namenode responsiveness to degrade and would increase the risk of data loss. A faster replication scheduling algorithm in the Namenode enables it to maintain quality of service to clients and replicate data faster in the event of losing many nodes at once.</li>
</ul>

<h4>Map/Reduce</h4>

<p>&nbsp;</p>

<ul>
<li><em>Switch awareness</em><br/>
Hadoop map/reduce is now capable of switch aware task placement. The framework attempts to place tasks on machines where their input data resides. In many cases (in particular with HoD), machines with input data are not available for running tasks, but machines that share a switch with such ‘input nodes’ are. Hadoop will now attempt to place tasks on machines that are switch local to input when machines with input data are unavailable.</li>
<li><em>Faster task scheduling</em><br/>
<a href="http://issues.apache.org/jira/browse/HADOOP-2119">HADOOP-2119</a> removes many inefficiencies in task placement and scheduling logic. The JobTracker would perform linear scans of the list of submitted tasks in cases where it did not find an obvious candidate task for a node. With better data structures for managing job state, all task placement operations now run in constant time.</li>
<li><em>Sort and shuffle improvements</em><br/>
A couple of significant improvements to sort and shuffle are included in the form of <a href="http://issues.apache.org/jira/browse/HADOOP-910">HADOOP-910</a> and <a href="http://issues.apache.org/jira/browse/HADOOP-2919">HADOOP-2919</a>. <a href="http://issues.apache.org/jira/browse/HADOOP-910">HADOOP-910</a> has reducers performing merges of shuffle data (both in memory and on disk) while fetching map outputs. <a href="http://issues.apache.org/jira/browse/HADOOP-2919">HADOOP-2919</a> improves memory management in sort on the map side, substantially decreases setup cost for the sort and uses quicksort instead of mergesort as the sorting algorithm.</li>
</ul>

Sameer Paranjpye
Yahoo! Grid Computing Team]]></description>
         <link>http://developer.yahoo.com/blogs/hadoop/2008/04/hadoop_017_preview.html</link>
         <guid>http://developer.yahoo.com/blogs/hadoop/2008/04/hadoop_017_preview.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">announcements</category>
        
        
         <pubDate>Mon, 28 Apr 2008 14:13:58 -0800</pubDate>
      </item>
            <item>
         <title>VIM Color Syntax Highlighting for Pig</title>
         <description><![CDATA[I joined the Yahoo! Research Engineering group a few weeks ago, and I was literally blown away with the possibilities that <a href="http://hadoop.apache.org/core/">Hadoop</a> and <a href="http://research.yahoo.com/node/90">Pig</a> open for me. Immediately, I wanted to hack up something good to say thank you to all smart people that build and support such a great software.

I am convinced that Pig deserves more respect from the major text editors, so I wrote a small vim script that adds syntax highlighting for Pig files.

<a href="http://us.dl1.yimg.com/download.yahoo.com/dl/ydn/vim-pig.png"><img width="350" height="289" src="http://us.dl1.yimg.com/download.yahoo.com/dl/ydn/vim-pig-sm.png" alt="pig in vim" border="0" /></a>

You can <a href="http://www.vim.org/scripts/script.php?script_id=2186">download it from vm.org site</a>.

To install, follow instructions on the web page, and don't forget to vote! :-)

Emacs version is coming up soon (yes, I use both vim *and* emacs). It will be my project for the upcoming Yahoo! Hack Day. 

Sergiy Matusevych
Yahoo! Research Engineer]]></description>
         <link>http://developer.yahoo.com/blogs/hadoop/2008/04/vim_color_syntax_highlighting.html</link>
         <guid>http://developer.yahoo.com/blogs/hadoop/2008/04/vim_color_syntax_highlighting.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">announcements</category>
        
        
         <pubDate>Fri, 25 Apr 2008 10:15:48 -0800</pubDate>
      </item>
            <item>
         <title>Hadoop Summit Slides and Video Available</title>
         <description><![CDATA[It's been a few weeks since the Hadoop Summit in Santa Clara, and we hope everyone had a good time and learned a lot.  Feedback has been quite good so far, but don't be shy about <a href="mailto:hadoop-summit@yahoo-inc.com">sending us comments</a>.

The Yahoo! Research team has assembled a single page containing <a href="http://research.yahoo.com/node/2104">links to all the presentation slides and video</a> from both the Hadoop Summit and the Data Intensive Computing Symposium.

As a sample, here's the opening presentation that Doug and Eric gave:

<embed src="http://cosmos.bcst.yahoo.com/up/fop/embedflv/swf/fop.swf" flashVars="id=7395540&amp;postpanelEnable=0&amp;prepanelEnable=0&amp;infopanelEnable=0&amp;carouselEnable=0" width="320" height="240" type="application/x-shockwave-flash"></embed>

<strong>Update:</strong> Videos are currently unavailable outside of Yahoo!  We're working on the problem...]]></description>
         <link>http://developer.yahoo.com/blogs/hadoop/2008/04/hadoop_summit_slides_and_video.html</link>
         <guid>http://developer.yahoo.com/blogs/hadoop/2008/04/hadoop_summit_slides_and_video.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">announcements</category>
        
        
         <pubDate>Fri, 18 Apr 2008 13:46:27 -0800</pubDate>
      </item>
            <item>
         <title>More Hadoop Summit Seats Available!  New Venue too.</title>
         <description><![CDATA[To say that we've been surprised by the interest in attending the <a href="http://developer.yahoo.com/hadoop/summit/">Hadoop Summit</a> would be an understatement.  We already <a href="http://developer.yahoo.com/blogs/hadoop/2008/02/hadoop_summit_nearly_full_well.html">expanded the capacity once</a> and that filled up in a matter of hours.  And that pretty much maxed out the event budget and parking too.

So last week when our friends at <a href="http://aws.typepad.com/">Amazon Web Services</a> got in touch to see if they could help, we started working on a plan to make the event even larger while still keeping it free.  Before long, we'd hatched a plan that involved moving off-site to a nearby venue, more food, more T-shirts, and some minor schedule tweaking.

But most importantly, we have room for about 75 more people!

As of now, we've increased the capacity of the event <a href="http://upcoming.yahoo.com/event/436226/">on Upcoming.org</a>.  If you're been watching and waiting to get on the liste of attendees, <a href="http://upcoming.yahoo.com/event/436226/">the time is now</a>.

The new venue is the <a href="http://www.networkmeetingcenter.com/">Network Meeting Center</a> which is <a href="http://www.networkmeetingcenter.com/location/">located</a> in Santa Clara, California.

Thanks to Amazon.com for buying everyone lunch during the summit. :-)

We'll be updating the agenda soon to include Jinesh from Amazon who will discuss GrepTheWeb - Hadoop on AWS.  As you may know, Amazon has many customers <a href="http://aws.typepad.com/aws/2008/02/taking-massive.html">running Hadoop on EC2</a>.

If you cannot attend, we're still planning to record all the talks and put them on-line within a week after the summit date.

See you at the summit!

<strong>See Also:</strong> <a href="http://aws.typepad.com/aws/2008/03/hadoop-summit-a.html">Hadoop Summit also scaled on-demand!</a> on the Amazon Web Services blog.

<a href="mailto:jzawodn@yahoo-inc.com">Jeremy Zawodny</a>
Yahoo! Developer Network]]></description>
         <link>http://developer.yahoo.com/blogs/hadoop/2008/03/hadoop-summit-move.html</link>
         <guid>http://developer.yahoo.com/blogs/hadoop/2008/03/hadoop-summit-move.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">announcements</category>
        
        
         <pubDate>Wed, 12 Mar 2008 10:57:36 -0800</pubDate>
      </item>
            <item>
         <title>An Introduction to ZooKeeper Video</title>
         <description><![CDATA[A few weeks ago, I had the chance to capture video of a presentation given by <a href="http://research.yahoo.com/Benjamin_Reed">Benjamin Reed</a> from Yahoo! Research.  His presentation was an introduction to <a href="http://sourceforge.net/projects/zookeeper">ZooKeeper</a>, a highly available and reliable coordination system built by <a href="http://research.yahoo.com/">Yahoo! Research</a> and released under the <a href="http://www.apache.org/licenses/LICENSE-2.0.html">Apache License, Version 2.0</a>.

Preparing to post the video, I asked Ben for a a summary of the motivations for building ZooKeeper.  Here's what he had to say:

<blockquote>In 2006 we were building distributed applications that needed a master, aka coordinator, aka controller to manage the sub processes of the applications. It was a scenario that we had encountered before and something that we saw repeated over and over again inside and outside of Yahoo!.</blockquote>

<blockquote>For example, we have an application that consists of a bunch of processes. Each process needs be aware of other processes in the system. The processes need to know how requests are partitioned among the processes. They need to be aware of configuration changes and failures. Generally an application specific central control process manages these needs, but generally these control programs are specific to applications and thus represent a recurring development cost for each distributed application. Because each control program is rewritten it doesn't get the investment of development time to become truly robust, making it an unreliable single point of failure.</blockquote>

<blockquote>We developed ZooKeeper to be a generic coordination service that can be used in a variety of applications. The API consists of less than a dozen functions and mimics the familiar file system API. Because it is used by many applications we can spend time making robust and resilient to server failures. We also designed it to have good performance so that it can be used extensively by applications to do fine grained coordination.</blockquote>

<blockquote>We have found ZooKeeper to be applicable to many distributed applications inside of Yahoo! and expect it to be applicable to many more outside of Yahoo! For that reason we released it as open source under the Apache license. If you are writing a distributed application, ZooKeeper can help.</blockquote>

And here's the video...

<embed src="http://cosmos.bcst.yahoo.com/up/fop/embedflv/swf/fop_wrapper.swf?sv=0&amp;id=6656185&amp;autoStart=0&amp;infoEnable=1&amp;shareEnable=1&amp;prepanelEnable=1&amp;carouselEnable=0&amp;postpanelEnable=1" width="420" height="350" type="application/x-shockwave-flash"></embed>

<img src="http://us.i1.yimg.com/us.yimg.com/i/nt/ic/ut/bsc/vidcam12_1.gif" border="0" hspace="10"><a href="http://us.dl1.yimg.com/download.yahoo.com/dl/ydn/zookeeper.m4v">download (m4v)</a>

A <a href="http://developer.yahoo.net/pdfs/hadoop/zookeeper.pdf">PDF copy of the slides</a> is available too.

While filming his 1 hour presentation, I found myself really wishing that ZooKeeper was available 6 or 7 years ago when I was struggling with how to perform distributed processing of news feeds for <a href="http://finance.yahoo.com/">Yahoo! Finance</a>.  ZooKeeper is clearly a more elegant solution than the hack we put together!

Ben will be speaking about ZooKeeper later this month at the <a href="http://developer.yahoo.com/hadoop/summit/">Hadoop Summit</a>.

More videos are available on <a href="http://developer.yahoo.com/blogs/theater/">YDN Theater</a>.

<a href="mailto:jzawodn@yahoo-inc.com">Jeremy Zawodny</a>
Yahoo! Developer Network]]></description>
         <link>http://developer.yahoo.com/blogs/hadoop/2008/03/intro-to-zookeeper-video.html</link>
         <guid>http://developer.yahoo.com/blogs/hadoop/2008/03/intro-to-zookeeper-video.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">misc</category>
        
        
         <pubDate>Fri, 07 Mar 2008 14:52:01 -0800</pubDate>
      </item>
            <item>
         <title>Hadoop Summit Nearly Full!  We&apos;ll Webcast and Post Video...</title>
         <description><![CDATA[We've been quite impressed by the overwhelming interest in the upcoming <a href="http://developer.yahoo.com/hadoop/summit/">Hadoop Summit</a> in late March.  Over 150 people have marked themselves as attending or watching on <a href="http://upcoming.yahoo.com/event/436226/">Upcoming.org</a>.

In fact, we hit our attendance cap a mere 36 hours after announcing the event.  So there's clearly <em>a lot</em> of interest in the program.  As a result, <strong>we're opening up more slots today.</strong>

We're going to do our best to make as much of the proceedings available on-line as quickly as possible.  We'd already planned to make high quality video recordings of all the talks available on <a href="http://developer.yahoo.com/blogs/theater/">YDN Theather</a> a week or so after the event.

Now we're also looking at ways to open it up <em>during the event</em> as well.  Expect to see a live webcast and a live chat room so that you can ask questions and chat with those who are physically in Sunnyvale, California for the summit.  Once all the details are nailed down, we'll post on the <a href="http://developer.yahoo.com/blogs/hadoop/">Hadoop blog</a> and on the <a href="http://developer.yahoo.com/hadoop/summit/">Hadoop Summit site</a> as well.

We could use IRC, <a href="http://live.yahoo.com/">Yahoo! Live</a>, or other options.  If you have an opinion, drop a note in the comments.

Thanks for all the interest!  This is an exciting time for Hadoop.

<a href="mailto:jzawodn@yahoo-inc.com">Jeremy Zawodny</a>
Yahoo! Developer Network]]></description>
         <link>http://developer.yahoo.com/blogs/hadoop/2008/02/hadoop_summit_nearly_full_well.html</link>
         <guid>http://developer.yahoo.com/blogs/hadoop/2008/02/hadoop_summit_nearly_full_well.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">announcements</category>
        
        
         <pubDate>Thu, 28 Feb 2008 12:10:53 -0800</pubDate>
      </item>
            <item>
         <title>Upcoming HBase User Group Meeting In San Francisco</title>
         <description><![CDATA[This is just a quick heads-up to let everyone know that the folks at Powerset are <a href="http://upcoming.yahoo.com/event/438056/">hosting a meeting</a> focused on <a href="http://hadoop.apache.org/hbase/">HBase</a> in early March:

<blockquote>Powerset is hosting the first user group meeting for HBase, a robust, scalable, distributed, column-oriented store capable of hosting billions of row of sparse, structured data. HBase is open source and part of the Hadoop project.</blockquote>

<blockquote>Definitely come if you're a current user of HBase; or if your company has plans for a huge data store and you're evaluating solutions.</blockquote>

See the event <a href="http://upcoming.yahoo.com/event/438056/">on upcoming.org</a> to sign up.

<a href="mailto:jzawodn@yahoo-inc.com">Jeremy Zawodny</a>
Yahoo! Developer Network]]></description>
         <link>http://developer.yahoo.com/blogs/hadoop/2008/02/upcoming_hbase_user_group_meet.html</link>
         <guid>http://developer.yahoo.com/blogs/hadoop/2008/02/upcoming_hbase_user_group_meet.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">announcements</category>
        
        
         <pubDate>Wed, 27 Feb 2008 15:25:43 -0800</pubDate>
      </item>
            <item>
         <title>Mailtrust Hadoop Talk in Virginia on Monday</title>
         <description><![CDATA[Just a quick heads up to Hadoop fans in the Virginia area.  <a href="http://billboebel.typepad.com/">Bill Boebel</a>, CTO of <a href="http://www.mailtrust.com/">Mailtrust</a>, will be giving a <a href="http://blog.racklabs.com/?p=67">MapReduce vs. SQL Talk on Monday the 25th</a>.  (Mailtrust is the email division of <a href="http://www.rackspace.com/">Rackspace</a>, a large hosting provider.)

Stu Hood, one of Mailtrust's software engineers wrote about <a href="http://blog.racklabs.com/?p=66">MapReduce at Rackspace</a> back in January, detailing how they use <a href="http://hadoop.apache.org/core/">Hadoop</a> for processing "several hundred gigabytes of email log data" every day.

<blockquote>The way it works is that raw logs get streamed from hundreds of mail servers to the Hadoop Distributed File System (”HDFS”) in real time, and scheduled MapReduce jobs run to index the new data using Apache Lucene and Solr. Once the indexes have been built, they are compressed and stored away in HDFS. Each Hadoop datanode also runs a Tomcat servlet container, which hosts a number of Solr instances that pull and merge the new indexes, and provide really fast search results to our support team.</blockquote>

<blockquote>Additionally, using MapReduce we are now able to look at our log data in all sorts of interesting ways. For example, we run nightly MapReduce jobs to collect statistics about our mail system, such as spam counts by domain, bytes transferred and number of logins. Now whenever we think of complex question about our customers’ usage patterns, we can pull the answer from our logs within hours via MapReduce. This is powerful stuff.</blockquote>

Read the whole posting for some interesting email stats they extracted.

Bill's talk should provide an excellent overview of Hadoop and some good insight into the Rackspace deployment.

<a href="mailto:jzawodn@yahoo-inc.com">Jeremy Zawodny</a>
Yahoo! Developer Network]]></description>
         <link>http://developer.yahoo.com/blogs/hadoop/2008/02/mailtrust_hadoop_talk_in_virgi.html</link>
         <guid>http://developer.yahoo.com/blogs/hadoop/2008/02/mailtrust_hadoop_talk_in_virgi.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">announcements</category>
        
        
         <pubDate>Fri, 22 Feb 2008 11:11:22 -0800</pubDate>
      </item>
            <item>
         <title>Announcing the Hadoop Summit at Yahoo, March 25th, 2008</title>
         <description><![CDATA[With all the growing interest in <a href="http://hadoop.apache.org/core/">Hadoop</a> (especially after <a href="That'd be this...

http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-largest-production-hadoop.html

I'll link that text in the post too.">yesterday's news</a>), it seems like a fitting time to mention <a href="http://developer.yahoo.com/hadoop/summit/">the Hadoop Summit we're hosting on March 25th</a>:

<blockquote>Yahoo! is hosting the first summit on Apache Hadoop on March 25th in Sunnyvale. The summit is sponsored by the Computing Community Consortium (CCC) and brings together leaders from the Hadoop developer and user communities. Hadoop is now being used by companies in production environments, by academic and industrial research groups, and at universites for teaching data parallel computing. The speakers will cover topics in the areas of extensions being developed for Hadoop, case studies of applications being built and deployed on Hadoop, and a discussion on future directions for the platform.</blockquote>

The lineup of speakers is a who's who of the Hadoop community and they all have some great experience to share.

Sign up over <a href="http://upcoming.yahoo.com/event/436226/">on Upcoming.org</a>.

Jeremy Zawodny
Yahoo! Developer Network]]></description>
         <link>http://developer.yahoo.com/blogs/hadoop/2008/02/announcing-the-hadoop-summit.html</link>
         <guid>http://developer.yahoo.com/blogs/hadoop/2008/02/announcing-the-hadoop-summit.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">announcements</category>
        
        
         <pubDate>Wed, 20 Feb 2008 11:29:50 -0800</pubDate>
      </item>
            <item>
         <title>Yahoo! Launches World&apos;s Largest Hadoop Production Application</title>
         <description><![CDATA[Yahoo! recently launched what we believe is the worlds largest <a href="http://hadoop.apache.org/">Apache Hadoop</a> production application.  The Yahoo! Search Webmap is a Hadoop application that runs on a more than 10,000 core Linux cluster and produces data that is now used in every Yahoo! Web search query.

The Webmap build starts with every Web page crawled by Yahoo! and produces a database of all known Web pages and sites on the internet and a vast array of data about every page and site.  This derived data feeds the Machine Learned Ranking algorithms at the heart of Yahoo! Search.

Some Webmap size data:

<ul>
<li>Number of links between pages in the index: <strong>roughly 1 trillion links</strong></li>
<li>Size of output: <strong>over 300 TB, compressed!</strong></li>
<li>Number of cores used to run a single Map-Reduce job: <strong>over 10,000</strong></li>
<li>Raw disk used in the production cluster: <strong>over 5 Petabytes</strong></li>
</ul>

This process is not new (see the AltaVista connectivity server). What is new is the use of Hadoop.  Hadoop has allowed us to run the identical processing we ran pre-Hadoop on the same cluster in 66% of the time our previous system took.  It does that while simplifying administration.  Further we believe that as we continue to scale up Hadoop, we will be able to scale up our production jobs as needed to larger cluster sizes.

Our team is very excited about the deployment of the Yahoo! Webmap on Hadoop because it demonstrates that although Hadoop is still at a very early stage in its development (perhaps even immature), Hadoop is now capable of handling truly Internet scale projects in a cost effective manner.  This and a number of other production system deployments in Yahoo! and other organizations demonstrate that Hadoop is gaining traction in the market and adding real value.

The Yahoo! Grid team has been enhancing and using Hadoop for various research and development tasks <a href="http://developer.yahoo.net/blog/archives/2007/07/yahoo-hadoop.html">since march 2006</a>.  We are proud of our role in taking Hadoop from a system that worked on dozens of computers two years ago, to a system that runs on thousands of computers today.  The Webmap launch demonstrates the power of Hadoop to solve truly Internet-sized problems and to function reliably in a large scale production setting.  We can now say that the results generated by the billions of Web search queries run at Yahoo! every month depend to a large degree on data produced by Hadoop clusters.

<embed src="http://cosmos.bcst.yahoo.com/up/fop/embedflv/swf/fop_wrapper.swf?sv=0&amp;id=6418984&amp;autoStart=0&amp;infoEnable=1&amp;shareEnable=1&amp;prepanelEnable=1&amp;carouselEnable=0&amp;postpanelEnable=1" width="400" height="300" type="application/x-shockwave-flash"></embed>

For more details about Yahoo!s Webmap project and the work that has gone into scaling Hadoop to support it, see an interview with two long-time colleges of mine, Arnab Bhattacharjee (manager of the Yahoo! Webmap Team) and Sameer Paranjpye (manager of our Hadoop development), embedded above.

Eric Baldeschwieler
Senior Director, Grid Computing
Yahoo!  Inc.]]></description>
         <link>http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-largest-production-hadoop.html</link>
         <guid>http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-largest-production-hadoop.html</guid>
        
        
         <pubDate>Tue, 19 Feb 2008 07:13:41 -0800</pubDate>
      </item>
            <item>
         <title>Hadoop Mention on Yahoo! Earnings Call</title>
         <description><![CDATA[In her prepared remarks during last week's quarterly earnings call, Yahoo! President Sue Decker said the following:

<blockquote>Although we haven’t elaborated on this much publicly, on the back end we have made a major investment in open source development of grid computing which provides a substantially greater scalability at fast iteration on core technologies. This is already dramatically impacting our competitiveness in algorithmic search and advertising.</blockquote>

<blockquote>For example, in some cases we have an order of magnitude 10x improvement in indexing speed. This has been a multi-year project and we’re on track to have our future Search and advertising systems built on the new infrastructure, positioning us well for acceleration in iteration and experiments that are likely to lead to significant future product enhancements.</blockquote>

In other words, we're using Hadoop more and more internally to dramatically improve some of our operations.  We'll be talking a lot more about that in the coming weeks, along with introducing some key members of the grid team here at Yahoo!

Stay tuned...

<a href="mailto:jzawodn@yahoo-inc.com">Jeremy Zawodny</a>
Yahoo! Developer Network]]></description>
         <link>http://developer.yahoo.com/blogs/hadoop/2008/02/hadoop_mention_on_yahoo_earnin.html</link>
         <guid>http://developer.yahoo.com/blogs/hadoop/2008/02/hadoop_mention_on_yahoo_earnin.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">misc</category>
        
        
         <pubDate>Mon, 04 Feb 2008 17:49:41 -0800</pubDate>
      </item>
            <item>
         <title>Hadoop is an Apache top level project (TLP)</title>
         <description><![CDATA[Just a quick note before the weekend hits (at least here in rainy California)...  A bit over a week ago, the <a href="http://hadoop.apache.org/core/">Hadoop project</a> moved from <strike>the <a href="http://incubator.apache.org/">Apache Incubator</a></strike> being a <a href="http://lucene.apache.org/">Lucene</a> sub-project to being a full-blown top level project under the Apache Software Foundation.

Its new home on the web is here: <a href="http://hadoop.apache.org/core/">http://hadoop.apache.org/core/</a>

Congrats to the Hadoop development team--the project is has grown a lot in the last year and shows no sign of slowing down!

See Also:

<ul>
<li><a href="http://problemsworthyofattack.blogspot.com/2008/01/hadoop-is-now-apache-top-level-project.html">Hadoop is now an Apache Top Level Project</a> (Tom White)</li>
<li><a href="http://www.jaxmag.com/itr/news/psecom,id,39870,nodeid,146.html">Hadoop - Top Level Project in Apache Software</a> (jax magazine)</li>	
<li><a href=""></a></li>
<li><a href=""></a></li>
</ul>

<a href="mailto:jzawodn@yahoo-inc.com">Jeremy Zawodny</a>
Yahoo! Developer Network]]></description>
         <link>http://developer.yahoo.com/blogs/hadoop/2008/01/hadoop_is_an_apache_top_level.html</link>
         <guid>http://developer.yahoo.com/blogs/hadoop/2008/01/hadoop_is_an_apache_top_level.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">announcements</category>
        
        
         <pubDate>Fri, 25 Jan 2008 13:17:12 -0800</pubDate>
      </item>
            <item>
         <title>If it hurts, automate it</title>
         <description><![CDATA[In many projects, it's painful to ensure that all unit tests are run on every patch <i>before</i> committing.  Add to this some other basic checks:

<ul>
  <li> no new javac compiler warnings </li>
  <li> no new <a href="http://findbugs.sourceforge.net/">Findbugs</a> warnings </li>
  <li> zero javadoc warnings </li>
  <li> zero @author attributions </li>
</ul>

and committing patches becomes very painful and time consuming indeed.

Something <i>that</i> painful, of course, is begging to be automated -- and that's exactly what we did with our patching process for the <a href="http://lucene.apache.org/hadoop/">Hadoop</a> project.

Every patch for Hadoop must be attached to a <a href="http://issues.apache.org/jira/browse/HADOOP">Jira</a> issue.  When a Jira issue is moved into the <i>Patch Available</i> state by a developer, the <a href="http://lucene.zones.apache.org:8080/hudson/">Hadoop continuous integration server</a> automatically picks up the issue's latest patch, applies it to a fresh checkout of trunk, builds the software, runs all the unit tests, and verifies all the other items listed above.  Once the build and testing is complete, a comment like this is automatically added to the Jira issue:

<blockquote>
<pre>
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12368715/1185_20071030b.patch
against trunk revision r590273.
</pre>
<pre>
  @author +1. The patch does not contain any @author tags.
  javadoc +1. The javadoc tool did not generate any warning messages.
  javac +1. The applied patch does not generate any new compiler warnings.
  findbugs +1. The patch does not introduce any new Findbugs warnings.
  core tests -1. The patch failed core unit tests.
  contrib tests -1. The patch failed contrib unit tests.
</pre>
<pre>
Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1033/testReport/
Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1033/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1033/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/1033/console
</pre>
<pre>
This message is automatically generated.
</pre>
</blockquote>

This automated system diminishes a <a href="http://lucene.apache.org/hadoop/credits.html#Committers">Hadoop committer's</a> burden to simply ensuring these two comments exist on the Jira issue:

<ul>
<li>a "+1" comment from the automated patch build, and</li>
<li>a "+1" comment from a code reviewer.</li>
</ul>

No more committing pain!

Nigel Daley
Grid Computing QA Lead

Want a fun job testing open source?  Passionate about software quality?  I'm hiring!  Talk to me at: ndaley at yahoo-inc dot com.  Testing and coding experience required.]]></description>
         <link>http://developer.yahoo.com/blogs/hadoop/2007/12/if_it_hurts_automate_it_1.html</link>
         <guid>http://developer.yahoo.com/blogs/hadoop/2007/12/if_it_hurts_automate_it_1.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">misc</category>
        
        
         <pubDate>Sat, 22 Dec 2007 08:22:31 -0800</pubDate>
      </item>
            <item>
         <title>Getting Paid to Test Open Source Software</title>
         <description><![CDATA[To my mind, there's really nothing better.  Working on open source software with a community of programmers passionate about what they're building, and getting paid to do it.  Perhaps this is becoming more common for developers, but it is certainly a rare occurrence for a quality engineer.  Very few companies that I know of dedicate QA resources to open source projects.

My <a href="http://careers.yahoo.com/">employer</a> asks that I contribute my testing expertise to the <a href="http://lucene.apache.org/hadoop/">Apache Hadoop</a> project and the <a href="http://incubator.apache.org/pig">Apache Pig</a> project, among others.

Tools are a big part of testing any project.  Finding the right test tools, that will add value, can require careful research and analysis.  This problem is somewhat constrained for open source projects, since the tools themselves should also be open sourced and (hopefully) distributable with the project.  A great list of open source testing tools can be found at <a href="http://opensourcetesting.org/">http://opensourcetesting.org/</a>.  In addition, some <a href="http://opensource.fortifysoftware.com ">companies</a> (and <a href="http://www.cenqua.com/">another</a> and <a href="http://www.atlassian.com ">another</a>) allow open source projects to use their tools for free, but not redistribute them.

On the <a href="http://lucene.apache.org/hadoop/">Hadoop</a> project, for instance, we use a number of tools that fit the above models:

<ul>
<li><a href="https://hudson.dev.java.net/">Hudson</a> - open source continuous integration server
<li><a href="http://findbugs.sourceforge.net/">Findbugs</a> - static analyzer of Java code
<li><a href="http://junit.org/">JUnit</a> - unit test framework
<li><a href="http://checkstyle.sourceforge.net/">Checkstyle</a> - coding style adherence tool
<li><a href="http://www.cenqua.com/clover/">Clover</a> - commercial code coverage tool, donated to Apache
<li><a href="http://www.atlassian.com/software/jira/">Jira</a> - commercial issue tracking tool, donated to Apache
</ul>

There are many other tools that I'd like to evaluate and integrate as time permits, including <a href="http://pmd.sourceforge.net/">PMD</a>, <a href="http://www.stanford.edu/~mhn/chord.html">Chord</a>, <a href="http://code.google.com/p/multithreadedtc">MultithreadedTC</a>.  Have experience using any of these tools on open source projects?  Have other tool suggestions?  I'd love to hear your comments!

Want a fun job testing open source?  Passionate about software quality?  I'm hiring!  Talk to me at: ndaley at yahoo-inc dot com.  Testing and coding experience required.

Nigel Daley
Grid Computing QA Lead]]></description>
         <link>http://developer.yahoo.com/blogs/hadoop/2007/12/getting_paid_to_test_open_sour.html</link>
         <guid>http://developer.yahoo.com/blogs/hadoop/2007/12/getting_paid_to_test_open_sour.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">misc</category>
        
        
         <pubDate>Mon, 03 Dec 2007 07:41:20 -0800</pubDate>
      </item>
            <item>
         <title>Pig into Incubation at the Apache Software Foundation</title>
         <description><![CDATA[A few weeks ago, a project called <a href="http://incubator.apache.org/pig/">Pig</a> went into <a href="http://incubator.apache.org/">incubation</a> at the <a href="http://apache.org/">Apache Software Foundation</a>.

Since you're probably scratching your head about what that sentence means, let me break it down for you.  Pig is a project that <a href="http://research.yahoo.com/node/90">began in Yahoo! Research</a> and we're building an open source community to further develop it via the Apache Software Foundation (ASF).  Right now it's in the initial phases of becoming a full-fledged project under the ASF umbrella.  That's commonly referred to as incubation, since it is hosted by the <a href="http://incubator.apache.org/">Apache Incubator</a>.  If you'd like more details, check out the <a href="http://wiki.apache.org/incubator/PigProposal">Pig Proposal</a> on the Incubator wiki.

<blockquote>The Incubator project is the entry path into The Apache Software Foundation (ASF) for projects and codebases wishing to become part of the Foundation's efforts. All code donations from external organisations and existing external projects wishing to join Apache enter through the Incubator.</blockquote>

Great.  So what's this Pig thing all about?  I asked that question of Olga Natkovich, one of the Pig developers here at Yahoo.

<blockquote>Pig is a high-level language (PigLatin) for data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.</blockquote>

In my mind, Pig is to Hadoop as SQL is to relational databases.  It's the language and logic that'll open up access to a much wider audience of people: anyone who can write a query.  Today you usually need to sit down write code to make use of the results from processing data on a Hadoop cluster.  By building a robust query layer on top of Hadoop, the barrier gets quite a bit lower.

See Also: <a href="http://glinden.blogspot.com/2007/04/yahoo-pig-and-google-sawzall.html">Yahoo Pig and Google Sawzall</a> (Greg Linden)

<a href="mailto:jzawodn@yahoo-inc.com">Jeremy Zawodny</a>
Yahoo! Developer Network]]></description>
         <link>http://developer.yahoo.com/blogs/hadoop/2007/11/pig_into_incubation.html</link>
         <guid>http://developer.yahoo.com/blogs/hadoop/2007/11/pig_into_incubation.html</guid>
                  <category domain="http://www.sixapart.com/ns/types#category">announcements</category>
        
        
         <pubDate>Fri, 30 Nov 2007 08:15:07 -0800</pubDate>
      </item>
      
   </channel>
</rss>
