<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:media="http://search.yahoo.com/mrss/">

<channel>
	<title>Yahoo! Hadoop Blog</title>
	<atom:link href="http://developer.yahoo.com/blogs/hadoop/feed/rss2/" rel="self" type="application/rss+xml" />
	<link>http://developer.yahoo.com/blogs/hadoop</link>
	<description></description>
	<lastBuildDate>Thu, 07 Jul 2011 02:16:00 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Hadoop Summit 2011 – A Different Approach</title>
		<link>http://developer.yahoo.com/blogs/hadoop/posts/2011/07/hadoop-summit-2011-%e2%80%93-a-different-approach/</link>
		<comments>http://developer.yahoo.com/blogs/hadoop/posts/2011/07/hadoop-summit-2011-%e2%80%93-a-different-approach/#comments</comments>
		<pubDate>Thu, 07 Jul 2011 02:16:00 +0000</pubDate>
		<dc:creator>Avik Dey</dc:creator>
				<category><![CDATA[Hadoop User Group]]></category>
		<category><![CDATA[2011]]></category>
		<category><![CDATA[@AvikonHadoop]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hadoop summit]]></category>

		<guid isPermaLink="false">http://developer.yahoo.com/blogs/hadoop/?p=6321</guid>
		<description><![CDATA[Hadoop Summit 2011 is over. If you saw this tweet ”#hadoopsummit planned for 1,500. upped on demand to 1,600. finally accommodated 1,700. ran out of space, good problem to have. :-),” then you probably got an idea of how exciting and mobbed the conference was this year. With folks dropping by from coast-to-coast, and quite [...]]]></description>
			<content:encoded><![CDATA[<p>Hadoop Summit 2011 is over. If you saw this <a href="http://twitter.com/#!/AvikonHadoop/statuses/87028194447855616">tweet</a> ”#hadoopsummit planned for 1,500. upped on demand to 1,600. finally accommodated 1,700. ran out of space, good problem to have. :-),” then you probably got an idea of how exciting and mobbed the conference was this year. With folks dropping by from coast-to-coast, and quite a few from around the world, Hadoop Summit 2011 will quite likely be the year’s largest Hadoop gathering. But even more so, because of the passion of everyone that participated, it was also the best Hadoop gathering of the year, raising the bar yet again for Hadoop technical content and networking.</p>

<p>At the Summit and since it ended, I have received questions from folks who attended the show and some who couldn’t make it. In general, a lot of people were curious about what went into developing the Summit and the approach we took to the Summit. I thought I’d take some time today and summarize my thoughts on this topic.</p>

<p>Obviously, in conference planning, a lot of the success of an event comes down to logistics, and fitting 1,700 people into the Santa Clara Convention Center for one action-packed today definitely requires a lot of logistics. But beyond those details, I think more important this year was the decision to change how we approached the Summit and to make sure the focus of the event was on building the Hadoop community itself. The Hadoop community will be at the heart of that continued innovation, so it is important that the community continues to grow and share with each other. </p>

<p>Here is what the Summit was really about for me and what I asked the team to focus on:</p>

<ol>
<li><p><strong>Content:</strong> This year we moved away from the “come as you please” style for presentation content that we had used in years past. What is the line that I used to stress this to the team? “Content is to the Summit what Location is to Real-Estate.”. Everything. Therefore all technical content was first selected then shepherded through a rigorous review and feedback process by the Program Committee. As a result, we heard some fantastic feedback on the quality and usefulness of the presentations this year in the technical tracks. Raymie Stata, Yahoo!’s CTO, made the point numerous times, saying that as a technology grows you usually see the amount and depth of technical content at conferences dedicated to that technology erode, but that at Hadoop Summit was just the opposite. If anything, the technical content was even deeper, which bodes well for Hadoop and the communities future.  I cannot thank my Track Co-chairs enough for their support in making this happen. You can find out more about them at: <a href="http://www.hadoopsummit.org">www.hadoopsummit.org</a>.</p></li>
<li><p><strong>Sponsored Content:</strong> Related to keeping the overall quality of technical content high at the Summit was the desire to keep the amount of sponsored content relatively low, ideally in the ball park of 30 minutes out of the day. Mind you, the goal was never <em>not</em> to have any sponsored content. Sponsors who are offering Hadoop-related solutions have a significant role to play in the Hadoop Ecosystem and its evolution, but especially with so many new attendees and folks just getting started with Hadoop, the most valuable content is the technical focused insights they can put to work as they ramp up projects and get up to speed. The results speak for themselves – check out my fun fact number 3 below. </p></li>
<li><p><strong>Ecosystem:</strong> There was some discussion and thinking early on in the conference planning that we should only include open source solutions in the program. However, in the end the majority of the Program Committee agreed that  the Summit should represent the entire Hadoop ecosystem. Why? Because while technically it’s possible to separate Apache Hadoop from other Hadoop powered solutions, actual users don’t always make this distinction when putting Hadoop to work. Some use Apache, some use other distributions, but all of these folks are Hadoop users and are therefore part of the Hadoop ecosystem. Without users, there is no Hadoop, so who are we to leave some out?</p></li>
<li><p><strong>Users:</strong> Speaking of users, they are who motivated us to have a brand new track this year focused on them, the Operations and Experience track. The goal of this track was to provide a forum for sharing how different companies are operating and managing Hadoop in the real world, or otherwise talk about their experience with Hadoop. As we had expected, this content was particularly popular with Hadoop users this year. I believe, in the years to come, given the pace of Hadoop growth the interest in this track will continue to increase and the content will no doubt expand as well.</p></li>
<li><p><strong>Developers:</strong> Finally, developers are still the core of the Hadoop Summit and are still the engine for innovation in Hadoop. Many attendees complimented me on what a great “event” Hadoop Summit was. Interestingly, I never really thought of the Summit as an event. For me it was a Hadoop developer and user gathering. As a developer myself, having been there and done that, now I enjoy helping showcase the amazing, high quality work that comes from the Hadoop community of developers. Helping great code get shared and adopted by developers and users is really the heart and soul of the Summit for me.</p></li>
</ol>

<p>Looking forward, on the heels of a successful Summit, what’s next? Here at Yahoo! I’ve been saying that Hadoop the Software is maturing, while Hadoop the Product is still in its nascence. Do I mean that Hadoop the Software is done? Not at all. What I mean is that future work on Hadoop as software will be focused on transforming it into Hadoop the product. That means a lot of development on Usability, Manageability and Operability. These very broad areas cover a lot of ground, but collectively they all go towards making Hadoop more Enterprise ready. Hadoop is already ready for the Enterprise when it comes to Scalability, Availability and Reliability. At Yahoo! we know this probably better than anyone else. But it is the rest of these core “abilities” that will put Hadoop on the fast track to Enterprise deployment at companies that don’t have large Hadoop engineering teams in-house.</p>

<p>Before I wrap-up, here are a few little known facts about this year’s Summit that speak to all of the above:</p>

<ol>
<li>The Summit had a 27% percent acceptance rate for submissions – it was very competitive and a great sign that the interest in Hadoop is driving high-quality technical content.</li>
<li>50% of the ticket sales happened in the last two weeks. In other words, you all need to plan better so we can fit more of you. :-)</li>
<li>There were no sponsored tech talks this year, none at all – so if you’re looking for unbiased, useful technical content, the Summit tech talks were as pure as can be.</li>
<li>Multiple leading database vendors are currently evaluating Hadoop and HBase for internal use.</li>
<li>Hadoop is designed for use with non-reliable storage. ;-)</li>
</ol>

<p>Looking forward to seeing the northern California Hadoopers at the July HUG. If you would like to attend, you can sign up here: <a href="http://www.meetup.com/hadoop/">http://www.meetup.com/hadoop/</a>.</p>

<p>These are exciting times for Hadoop and may you enjoy living through them.</p>

<p>/later</p>
]]></content:encoded>
			<wfw:commentRss>http://developer.yahoo.com/blogs/hadoop/posts/2011/07/hadoop-summit-2011-%e2%80%93-a-different-approach/feed/rss2/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Fourth Annual Hadoop Summit: The Countdown Begins!</title>
		<link>http://developer.yahoo.com/blogs/hadoop/posts/2011/06/fourth-annual-hadoop-summit-the-countdown-begins/</link>
		<comments>http://developer.yahoo.com/blogs/hadoop/posts/2011/06/fourth-annual-hadoop-summit-the-countdown-begins/#comments</comments>
		<pubDate>Wed, 15 Jun 2011 00:05:37 +0000</pubDate>
		<dc:creator>Avik Dey</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://developer.yahoo.com/blogs/hadoop/?p=5971</guid>
		<description><![CDATA[On June 29, Yahoo! will host the 4th annual Hadoop Summit at the Santa Clara Convention Center. Hadoop Summit 2011 brings together some of the most influential thought leaders in the space - from Yahoo, Facebook, IBM, NetApp, and others. Jay Rossiter, Senior Vice President of the Yahoo! Cloud Platform Group will open the show [...]]]></description>
			<content:encoded><![CDATA[<p>On June 29, Yahoo! will host the 4th annual Hadoop Summit at the Santa Clara Convention Center. Hadoop Summit 2011 brings together some of the most influential thought leaders in the space - from Yahoo, Facebook, IBM, NetApp, and others. </p>

<p>Jay Rossiter, Senior Vice President of the Yahoo! Cloud Platform Group will open the show with a keynote around how Yahoo! is developing the next generation of Hadoop applications to handle big data, the important role that Hadoop plays in Yahoo!’s integrated technology ecosystem and how wide industry adoption of Hadoop is benefiting the entire community.</p>

<p>Also on the main stage, Facebook will discuss its use of Hadoop to power the Facebook Messages infrastructure and IBM will discuss how they used Hadoop to power supercomputer, Watson.</p>

<p>Additional conference highlights include some key sessions:</p>

<ul>
<li><em>Next Generation Apache Hadoop MapReduce:</em> Arun Murthy, Yahoo!’s lead architect on the  Hadoop Map-Reduce development team, will lead a discussion on the next generation of Apache Hadoop MapReduce that factors the framework into a generic resource scheduler and a per-job, user-defined component that manages the application execution.</li>
<li><em>Introducing HCatalog (Hadoop Table Manager):</em> Alan Gates, Yahoo! architect for Pig and Howl, will provide an overview of HCatalog as well as the release/roadmap.</li>
<li><em>Automated Rolling OS Upgrades for Yahoo! Hadoop Grids:</em> Dan Romike, Yahoo! Hadoop Data and Grid Systems engineer, will detail how we are upgrading thousands of servers, the problems of system state management, and the operational workflows specific to a Hadoop grid environment.</li>
<li><em>Case Studies of Hadoop Operations at Yahoo!:</em> The Grid Operations team at Yahoo! operates about 40,000 servers running Hadoop in clusters of up to 4,200 servers. Charles Wimmer, Yahoo! senior service engineer of grid computing, will provide a dive deep into a series of case studies that exemplify these issues.</li>
<li><em>Large Scale Math with Hadoop MapReduce:</em> Nerd out and learn how Yahoo! established a new world record by computing the two quadrillionth bits of pi using Hadoop in July 2010. Widely covered in the news, the world record computation was composed of 35,000 MapReduce jobs, requiring 23 days of real time and 503 years of CPU time in Yahoo! clusters. Led by Yahoo! Hadoop engineer Tsz-Wo Sze, attendees will also learn MapReduce algorithms for large-scale mathematical calculation, their implementation, and our experience in running and tuning these computations in Hadoop clusters.</li>
</ul>

<p>Check out the official <a href="http://developer.yahoo.com/events/hadoopsummit2011/agenda.html">conference agenda</a>  for a full preview of what’s to come and a look at the 32 different sessions, including best practice deep-dives and case studies on the Hadoop roadmap, operations and management, innovative Hadoop applications and research, and much more.</p>

<p>Space is limited, so don't miss this unique opportunity to hear directly from Hadoop thought leaders and pioneers by registering now.</p>

<p>And finally, a special thanks to the Hadoop Summit 2011 sponsors:</p>

<p><em>Platinum Sponsors:</em><br />
* MapR Technologies<br />
* NetApp<br /></p>

<p><em>Gold Sponsors:</em><br />
* Aster Data<br />
* Cloudera<br />
* DataStax<br />
* Datameer<br />
* IBM<br /></p>

<p><em>Silver Sponsors:</em><br />
* Amazon Web Services<br />
* Arista<br />
* Impetus<br />
* Pentaho<br />
* Syncsort<br /></p>

<p><em>Sponsors:</em><br />
* Dell<br />
* Hadapt<br />
* HStreaming<br />
* Jive<br />
* Karmasphere<br />
* Mellanox<br />
* Pervasive DataRush<br />
* Quest Software<br />
* Softlayer<br />
* StackIQ<br />
* ThinkBig Analytics<br /></p>

<p>Stay up to date on Hadoop Summit buzz by following #hadoopsummit on Twitter.</p>

<p>If you are a Hadooper or would like to become one, come join the community for this one day event.</p>
]]></content:encoded>
			<wfw:commentRss>http://developer.yahoo.com/blogs/hadoop/posts/2011/06/fourth-annual-hadoop-summit-the-countdown-begins/feed/rss2/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Slides from eric14 talks @ #IbmBigData</title>
		<link>http://developer.yahoo.com/blogs/hadoop/posts/2011/05/eric14-talks-ibmbigdata/</link>
		<comments>http://developer.yahoo.com/blogs/hadoop/posts/2011/05/eric14-talks-ibmbigdata/#comments</comments>
		<pubDate>Fri, 13 May 2011 22:11:39 +0000</pubDate>
		<dc:creator>Eric Baldeschwieler</dc:creator>
				<category><![CDATA[Presentations]]></category>

		<guid isPermaLink="false">http://developer.yahoo.com/blogs/hadoop/?p=5671</guid>
		<description><![CDATA[Hi Folks, Here are my slides from the IBM big data symposium. This was a good event. IBM announced a new release of their Apache Hadoop based Big Insights platform. It is great to hear their commitment to Apache. Yahoo was there talking about our experiences and uses of Hadoop. I got a lot of [...]]]></description>
			<content:encoded><![CDATA[<p>Hi Folks,</p>

<p>Here are my slides from the IBM big data symposium.  This was a good event.  IBM announced a new release of their Apache Hadoop based Big Insights platform.  It is great to hear their commitment to Apache.  Yahoo was there talking about our experiences and uses of Hadoop.  I got a lot of questions about why we invest in Hadoop, so let me point you back to my post on that and our commitment to Apache Hadoop. (<a href="http://yhoo.it/e8p3Dd">http://yhoo.it/e8p3Dd</a> and <a href="http://yhoo.it/i9Ww8W">http://yhoo.it/i9Ww8W</a>)</p>

<p>Thanks, 
E14</p>

<div style="width:425px;" id="__ss_7958280"> <strong style="display:block;margin:12px 0 4px;"><a href="http://www.slideshare.net/jeric14/hadoop-ibmbigdata" title="hadoop @ Ibmbigdata">hadoop @ Ibmbigdata</a></strong> <iframe src="http://www.slideshare.net/slideshow/embed_code/7958280" width="425" height="355" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe> <div style="padding:5px 0 12px;"> View more <a href="http://www.slideshare.net/">presentations</a> from <a href="http://www.slideshare.net/jeric14">Eric Baldeschwieler</a> </div> </div>
]]></content:encoded>
			<wfw:commentRss>http://developer.yahoo.com/blogs/hadoop/posts/2011/05/eric14-talks-ibmbigdata/feed/rss2/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Hadoop Summit CFP closing tomorrow!</title>
		<link>http://developer.yahoo.com/blogs/hadoop/posts/2011/05/hadoop-summit-cfp-closing-tomorrow/</link>
		<comments>http://developer.yahoo.com/blogs/hadoop/posts/2011/05/hadoop-summit-cfp-closing-tomorrow/#comments</comments>
		<pubDate>Thu, 05 May 2011 23:00:41 +0000</pubDate>
		<dc:creator>Owen OMalley</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://developer.yahoo.com/blogs/hadoop/?p=5591</guid>
		<description><![CDATA[Stack and I are the track organizers for the community track at the Hadoop Summit this year. The community track is for presentations on roadmap, developments and features in Apache Hadoop. So if you've added a new feature to Hadoop and want to publicize it to the world's largest and most important Hadoop conference, please [...]]]></description>
			<content:encoded><![CDATA[<p>Stack and I are the track organizers for the community track at the Hadoop Summit this year. The community track is for presentations on roadmap, developments and features in Apache Hadoop. So if you've added a new feature to Hadoop and want to publicize it to the world's largest and most important Hadoop conference, please submit it!</p>

<p><a href="http://developer.yahoo.com/events/hadoopsummit2011/">http://developer.yahoo.com/events/hadoopsummit2011/</a></p>

<p>The deadline is 6 May, which is tomorrow!</p>
]]></content:encoded>
			<wfw:commentRss>http://developer.yahoo.com/blogs/hadoop/posts/2011/05/hadoop-summit-cfp-closing-tomorrow/feed/rss2/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Call for participation in the Hadoop Summit Research Track</title>
		<link>http://developer.yahoo.com/blogs/hadoop/posts/2011/05/call-for-participation-in-the-hadoop-summit-research-track/</link>
		<comments>http://developer.yahoo.com/blogs/hadoop/posts/2011/05/call-for-participation-in-the-hadoop-summit-research-track/#comments</comments>
		<pubDate>Mon, 02 May 2011 22:02:57 +0000</pubDate>
		<dc:creator>breed</dc:creator>
				<category><![CDATA[Announcements]]></category>
		<category><![CDATA[Hadoop User Group]]></category>
		<category><![CDATA[hadoop summit]]></category>

		<guid isPermaLink="false">http://developer.yahoo.com/blogs/hadoop/?p=5391</guid>
		<description><![CDATA[Hadoop Summit is a great annual gathering of developers to talk about all things Hadoop. The attendance is great, we are expecting 2000 this year; the presentations are excellent; and the hallway conversations are a great way to meet new people and come up with new ideas. This environment is especially great if you have [...]]]></description>
			<content:encoded><![CDATA[<p>Hadoop Summit is a great annual gathering of developers to talk about all things Hadoop. The attendance is great, we are expecting 2000 this year; the presentations are excellent; and the hallway conversations are a great way to meet new people and come up with new ideas.</p>

<p>This environment is especially great if you have a great idea that you would like to share with the community. You will have a great audience of knowledgeable developers that you can try to convince to help you to take your work to the next level. Doesn't it sound ... great!?!</p>

<p>Milind and I are organizing the research and application track. If you have built some new framework on top of Hadoop or made Hadoop better, let us know. We will be selecting the most interesting results for the research and application track. </p>

<p>General information for the Hadoop Summit is at <a href="http://hadoopsummit.org">http://hadoopsummit.org</a>. You can submit an abstract for your presentation at <a href="http://developer.yahoo.com/events/hadoopsummit2011/presentationguidelines.html">http://developer.yahoo.com/events/hadoopsummit2011/presentationguidelines.html</a></p>

<p>We are looking for presentations of new applications or improvements that have been implemented. We are not looking for position papers or visions. It is fine to submit a presentation that has been presented in other venues. If you are submitting something that has been published in a research conference, you may want to put an engineering perspective on it. This crowd looks behind the beautiful ideas and is going to want to see how/if it really works.</p>

<p>Hope to see you there!</p>
]]></content:encoded>
			<wfw:commentRss>http://developer.yahoo.com/blogs/hadoop/posts/2011/05/call-for-participation-in-the-hadoop-summit-research-track/feed/rss2/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>HCatalog, tables and metadata for Hadoop</title>
		<link>http://developer.yahoo.com/blogs/hadoop/posts/2011/04/hcatalog-tables-and-metadata-for-hadoop/</link>
		<comments>http://developer.yahoo.com/blogs/hadoop/posts/2011/04/hcatalog-tables-and-metadata-for-hadoop/#comments</comments>
		<pubDate>Fri, 29 Apr 2011 21:02:24 +0000</pubDate>
		<dc:creator>Alan Gates</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://developer.yahoo.com/blogs/hadoop/?p=5451</guid>
		<description><![CDATA[Last month the HCatalog project (formerly known as Howl) was accepted into the Apache Incubator. We have already branched for a 0.1 release, which we hope to push in the next few weeks. Given all this activity, I thought it would be a good time to write a post on the motivation behind HCatalog, what [...]]]></description>
			<content:encoded><![CDATA[<p>Last month the HCatalog project (formerly known as Howl) was accepted into the Apache Incubator.  We have already branched for a 0.1 release, which we hope to push in the next few weeks.  Given all this activity, I thought it would be a good time to write a post on the motivation behind HCatalog, what features it will provide, and who is working on it.</p>

<h2>Why Did We Create HCatalog?</h2>

<p>Out of the box Hadoop provides the HDFS file system for users to store their data.  File systems are nice because they provide a simple interface.  Users can easily copy data into the file system and run jobs against that data.  However, for more complex data processing tasks, the file system abstraction is not rich enough.  It forces users to know where data is located, what format it is stored in, how it is compressed, and what its schema is.  Consider, for example, a Pig Latin script used to do ETL on raw web logs:</p>

<p><code>
A = load '/data/raw/ds=20110225/region=us/property=news' using PigStorage()<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;as (user:chararray, url:chararray, timestamp:long);<br />
...
</code></p>

<p>If the grid administrator needs to move this data to a new location, or compact multiple files into an archive (such as har), or the data producer changes the schema or starts writing it compressed, this Pig Latin script will need to be changed.  Pig and MapReduce job specifications are tightly coupled with the data storage.  This inhibits data producers' ability to improve their processes.  They and their users are forced to go through a painful transition process to take advantage of any improvements.  This lack of data storage independence is exacerbated by the fact that in most large Hadoop installations there will be users using different tools to access their data.  So a change in storage format will ripple through multiple groups and require coordination across disparate data processing tools.</p>

<p>The fact that in many Hadoop installations different users will be using different tools brings up another issue.  These tools do not share a common notion of schemas or data types.  Pig and Hive have different, though similar, data models.  MapReduce leaves data types to its users.  This can make sharing a data set across users with different tools difficult and error prone.</p>

<p>When data consumers are building new applications or want to query new data sets, finding the data in a file system is difficult.  Generally, a user has to contact the group that produces the data and ask them where they write their data, what format it is in, and what its schema is.  Some organizations use wiki pages or similar mechanisms to record this information.  But such pages inevitably get stale as changes to the data storage are made.</p>

<p>Even once users know where data is, file systems do not help them know when it is available.  In a complex data processing environment there will be many users waiting for the availability of a given data set.  This is particularly true of the foundational data sets that form the basis for much of an organization's data processing.  With a file system, the only way to know if data is available is to do an 'ls'.  Hundreds of users banging away with 'ls' commands every few seconds does not help HDFS' performance.</p>

<p>Finally, a file system tends to become a dumping ground with little knowledge of how the data there should be managed.  How long should data be kept?  Are there legal reasons to store it on tape before deleting it?  If so, how do you know what has been archived and what has not?  With a file system all of these issues tend to be addressed by conventions.  "Data in this directory is cleaned up after 30 days."  "After data is archived a '.archive' file is placed in its directory."  Often each data creator has to answer these questions, and they tend to answer them differently.  Thus an organization ends up with not one data management policy, but twenty, thirty, or a hundred policies.</p>

<h2>How HCatalog Addresses These Issues</h2>

<p>Hadoop needs a better abstraction for data storage, and it needs a metadata service.  HCatalog addresses both of these issues.  It presents users with a table abstraction.  This frees them from knowing where or how their data is stored.  It allows data producers to change how they write data while still supporting existing data in the old format so that data consumers do not have to change their processes.  It provides a shared schema and data model for Pig, Hive, and MapReduce.  It will enable notifications of data availability.  And it will provide a place to store state information about the data so that data cleaning and archiving tools can know which data sets are eligible for their services.</p>

<p>HCatalog takes Hive's metastore, and wraps additional layers around it to provide these services.  It comes with <tt>HCatInputFormat</tt> and <tt>HCatOutputFormat</tt> for MapReduce users, and <tt>HCatLoader</tt> and <tt>HCatStorer</tt> for Pig users.  Taking the Pig script example above, using HCatalog it looks like:</p>

<p><code>
A = load 'raw' using HCatLoader();<br />
B = filter A by ds='20110225' and region='us' and property='news';<br />
...
</code></p>

<p>Notice that Pig no longer knows or cares about the file location or storage format.  It is just loading a table named 'raw'.  If an administrator decides to relocate the files that store raw, or switch to a better storage format, this Pig Latin script does not change at all.  Also notice that the user no longer has to declare the schema.  <tt>HCatLoader</tt> communicates that automatically to Pig.  HCatalog uses Hive's data model.  When loading data into Pig, the types are translated to Pig types.  For example, a Hive struct becomes a Pig tuple.  When used by MapReduce, the value that <tt>HCatInputFormat</tt> returns is an <tt>HCatRecord,</tt> which is an ordered list of typed data.  At the same time, this does not force Pig or MapReduce to scan the whole table.  Pig knows how to push the filter statement into <tt>HCatLoader</tt> so that only the appropriate files are read.</p>

<p>HCatalog includes Hive's command line interface so that administrator can create and drop tables, specify table parameters, etc.  It will also allow users to explore what tables are available and what their schema is.</p>

<p>HCatalog also provides an API for storage format developers to tell HCatalog how to read and write data stored in different formats.  Currently HCatalog knows how to read and write RCFiles.  But if data is stored in a different format, a user can implement an <tt>HCatInputStorageDriver</tt> and <tt>HCatOutputStorageDriver</tt> to tell HCatalog how to translate between your data storage and record format HCatalog uses.  Which StorageDriver to use is stored at the partition level.  So if you need to change how your table is stored when you already have a year's data in it, there is no need to reprocess the data.  New data can be written in the new format while the old data stays in the old format.  HCatalog handles using the correct StorageDriver to read each partition, allowing users to read across partitions, never knowing they are in different formats.</p>

<h2>HCatalog Is Brought To You By...</h2>

<p>HCatalog is a collaborative effort between members from the Apache Pig, Hive, and Hadoop projects, plus new contributors.  Most of the new code has been written by Yahoos, while the Hive team has been very helpful in providing design feedback and pushing HCatalog's changes into the Hive metastore.</p>

<h2>Where Can You Learn More?</h2>

<p>You can learn more about HCatalog, download the source, file JIRAs, and join the mailing lists on <a href="http://incubator.apache.org/hcatalog">HCatalog's website</a>.  You can also come join us at the <a href="http://developer.yahoo.com/events/hadoopsummit2011/">Hadoop Summit</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://developer.yahoo.com/blogs/hadoop/posts/2011/04/hcatalog-tables-and-metadata-for-hadoop/feed/rss2/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Hadoop User Group meeting recap, March 2011</title>
		<link>http://developer.yahoo.com/blogs/hadoop/posts/2011/04/hadoop-user-group-meeting-recap-march-2011/</link>
		<comments>http://developer.yahoo.com/blogs/hadoop/posts/2011/04/hadoop-user-group-meeting-recap-march-2011/#comments</comments>
		<pubDate>Fri, 08 Apr 2011 23:52:48 +0000</pubDate>
		<dc:creator>Aravind Srinivasan</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://developer.yahoo.com/blogs/hadoop/?p=5221</guid>
		<description><![CDATA[More than 200 Hadoop developers and enthusiasts congregated on the Yahoo campus for the monthly HUG meeting on March 16-Th. As always, they were treated to some enlightening presentations in addition to good food and beverages. After the usual 30 minutes of socializing and networking, Milind Bhandarkar from LinkedIn, kicked off the evening with a [...]]]></description>
			<content:encoded><![CDATA[<p>More than 200 Hadoop developers and enthusiasts congregated on the Yahoo campus for the monthly HUG meeting on March 16-Th. As always, they were treated to some enlightening presentations in addition to good food and beverages. </p>

<p>After the usual 30 minutes of socializing and networking, Milind Bhandarkar from LinkedIn, kicked off the evening with a really enlightening talk on "Scaling Hadoop Applications." As a well-respected Hadoop expert and a founding member of the Hadoop team at Yahoo in 2005, Milind was able to articulate the issues and solutions very succinctly. His talk was especially interesting because he tied well known theorems and laws around scalability to the ground realities on the Hadoop clusters today.</p>

<p>Here are the slides from Milind's talk.</p>

<div style="width:425px;" id="__ss_7566567"><strong style="display:block;margin:12px 0 4px;"><a href="http://www.slideshare.net/ydn/hug-march-2011-scaling-hadoop" title="Hug - March 2011 - Scaling Hadoop">Hug - March 2011 - Scaling Hadoop</a></strong><embed  allowscriptaccess="never" type="application/x-shockwave-flash" name="__sse7566567" src="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=hugmarchscalinghadoop-110408181532-phpapp01&#038;stripped_title=hug-march-2011-scaling-hadoop&#038;userName=ydn" allowfullscreen="true" width="425" height="355"></embed></div>

<p>Following is the <a href="http://www.youtube.com/watch?v=DcZFeFtMlpY">video</a> of the presentation.</p>

<p><embed  allowscriptaccess="never" type="application/x-shockwave-flash" src="http://www.youtube.com/v/DcZFeFtMlpY?hl=en&#038;fs=1" allowfullscreen="true" width="425" height="344"></embed> </p>

<p>This was followed by an interesting talk on "HDFS Federation" by Yahoo's Suresh Srinivas. HDFS Federation is a major feature slated to come out in the near future and Suresh gave the audience an in-depth and in-the-trenches look at this key feature. This also tied into the theme of the day nicely as federation is all about scaling today's Hadoop clusters and making them bigger and faster. </p>

<p>You can find the slide-deck from Suresh's presentation below.</p>

<div style="width:425px;" id="__ss_7566644"><strong style="display:block;margin:12px 0 4px;"><a href="http://www.slideshare.net/ydn/hug-march-hdfs-federation" title="Hug - March -  HDFS Federation">Hug - March -  HDFS Federation</a></strong><embed  allowscriptaccess="never" type="application/x-shockwave-flash" name="__sse7566644" src="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=hugmarchfederation-110408182840-phpapp02&#038;stripped_title=hug-march-hdfs-federation&#038;userName=ydn" allowfullscreen="true" width="425" height="355"></embed></div>

<p>This talk concluded an interesting HUG. Thanks to all the Hadoop users and presenters who attended the March HUG and hope to see you all again soon. If you have an interesting topic related to Hadoop and would like to present, please don't hesitate to contact us.</p>
]]></content:encoded>
			<wfw:commentRss>http://developer.yahoo.com/blogs/hadoop/posts/2011/04/hadoop-user-group-meeting-recap-march-2011/feed/rss2/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Hadoop Summit 2011: June 29th, Santa Clara Convention Center</title>
		<link>http://developer.yahoo.com/blogs/hadoop/posts/2011/04/hadoop-summit-2011-june-29th-santa-clara-convention-center/</link>
		<comments>http://developer.yahoo.com/blogs/hadoop/posts/2011/04/hadoop-summit-2011-june-29th-santa-clara-convention-center/#comments</comments>
		<pubDate>Thu, 07 Apr 2011 18:20:32 +0000</pubDate>
		<dc:creator>Avik Dey</dc:creator>
				<category><![CDATA[Announcements]]></category>
		<category><![CDATA[Hadoop User Group]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[Pig]]></category>
		<category><![CDATA[Presentations]]></category>
		<category><![CDATA[2011]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[Summit]]></category>

		<guid isPermaLink="false">http://developer.yahoo.com/blogs/hadoop/?p=5061</guid>
		<description><![CDATA[Hadoop Summit 2011 – Registration Now Open! Calling all Hadoopers Yahoo! is pleased to announce that this year’s Hadoop Summit is scheduled for June 29th at the Santa Clara Convention Center. Registration for the event is now open and offers an early bird special of $125, a savings of nearly 30% on the full ticket [...]]]></description>
			<content:encoded><![CDATA[<p><strong>Hadoop Summit 2011 – Registration Now Open!</strong></p>

<p><strong>Calling all Hadoopers</strong></p>

<p>Yahoo! is pleased to announce that this year’s <a href="http://www.hadoopsummit.org">Hadoop Summit</a> is scheduled for June 29th at the Santa Clara Convention Center. Registration for the event is now open and offers an early bird special of $125, a savings of nearly 30% on the full ticket price of $175. <u>This ends on May 1st</u>, so <strong><a href="http://www.hadoopsummit.org">register now</a></strong> to take advantage of this great offer.</p>

<p>Whether you are already running and managing a Hadoop installation, developing Hadoop-based applications or exploring how to adopt <a href="http://hadoop.apache.org">Apache Hadoop</a> for your business, the summit provides a unique opportunity to gain deep insights into the world of Hadoop from the company that pioneered it. Learn about interesting and relevant real-world applications and find out about the latest Big Data research.</p>

<p>The summit brings together some of the most influential speakers in the Hadoop space. Our full agenda provides many informative tracks for developers, administrators, managers and researchers. A session focused on new and innovative applications using Hadoop which was ‘standing-room’ only last year will again be a highlight of the event. Stay tuned <a href="http://www.hadoopsummit.org">the agenda</a> will be posted shortly as we continue to update it.</p>

<p><a href="http://developer.yahoo.com/events/hadoopsummit2010/">Last year</a>, we had to close registration early as we reached capacity extremely quickly with over 1200 attendees. This year we expect to attract an even larger audience as Hadoop continues to go mainstream, so please make sure you register early because space is limited. Yahoo! is the world’s largest contributor to Apache Hadoop and is joined by a number of technology leaders like Amazon, China Mobile, Facebook, Linkedin and HP, all using Hadoop for their businesses. Come and find out how your company could benefit from using Hadoop.</p>

<p>To ensure you get a ticket, take advantage of our early bird special which costs $125, a savings of nearly 30% on the full ticket price of $175 – <strong><a href="http://www.hadoopsummit.org">Register here</a></strong>. The Early Bird ticket price <u>ends May 1st</u>.</p>

<p>If you are working on projects on or using Hadoop, this is also a terrific opportunity for you to share your experience with thousands of fellow Hadoop users. To submit a session abstract for one of the tracks below you can find the presentation guidelines and detailed <strong><a href="http://www.hadoopsummit.org">instructions here</a></strong>. Submission deadline is May 6th.</p>

<p>Based on community feedback and interest here are the tracks for this year:</p>

<p><strong>Community</strong>: Presentations and discussions about Hadoop roadmap, contribution and best practices.</p>

<p><strong>Operations and Management</strong>: Presentations and discussions about operations and management of Hadoop clusters and best practices.</p>

<p><strong>Applications and Research</strong>: Case studies about innovative applications and Hadoop technologies based research studies. </p>

<p>We’re excited to be bringing you the Hadoop Summit 2011, and look forward to seeing you <strong><a href="http://www.hadoopsummit.org">there</a></strong>.</p>
]]></content:encoded>
			<wfw:commentRss>http://developer.yahoo.com/blogs/hadoop/posts/2011/04/hadoop-summit-2011-june-29th-santa-clara-convention-center/feed/rss2/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Apache Hadoop project wins MediaGuardian Innovation award</title>
		<link>http://developer.yahoo.com/blogs/hadoop/posts/2011/03/hadoop-project-wins-innovation-award/</link>
		<comments>http://developer.yahoo.com/blogs/hadoop/posts/2011/03/hadoop-project-wins-innovation-award/#comments</comments>
		<pubDate>Fri, 25 Mar 2011 00:39:08 +0000</pubDate>
		<dc:creator>Owen OMalley</dc:creator>
				<category><![CDATA[Announcements]]></category>
		<category><![CDATA[Top Features]]></category>
		<category><![CDATA[award]]></category>
		<category><![CDATA[Guardian]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[innovation]]></category>

		<guid isPermaLink="false">http://developer.yahoo.com/blogs/hadoop/?p=4831</guid>
		<description><![CDATA[The Hadoop project won the top MediaGuardian Innovation award. A groundbreaking open source project has won the top prize at the 2011 MediaGuardian Innovation Awards.The judging panel described the Apache Hadoop project as the Swiss army knife of the 21st Century, and having the potential to completely change the face of media innovations across the [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://ydn.zenfs.com/blogs/22/DSC_4161.jpg"><img src="http://ydn.zenfs.com/blogs/22/DSC_4161.jpg" alt="" title="Apache Hadoop Innovation Award" width="320" height="214" class="alignright size-full wp-image-4811"/></a>The Hadoop project won the top MediaGuardian Innovation award. </p>

<blockquote><a href="http://www.guardian.co.uk/gnm-press-office/mediaguardian-innovation-awards-2011">A groundbreaking open source project has won the top prize at the 2011 MediaGuardian Innovation Awards.</a><p>The judging panel described the Apache Hadoop project as the Swiss army knife of the 21st Century, and having the potential to completely change the face of media innovations across the globe. Overall, the project was seen as a greater catalyst for innovation than WikiLeaks, the iPad and a host of other suggested nominees.</blockquote>

<p>All of the Hadoop contributors should be very proud of this award. Sanjay Radia, Jakob Homan, and I attended in person as members of the Hadoop Project Management Committee to receive the award on behalf of the project.</p>

<p>I've been working on Hadoop full time since the beginning and it has been a pleasure working with such bright and dedicated engineers. It takes a village to raise an elephant from a prototype that runs on a few nodes to the project that is disrupting the big data industry.</p>

<p>Congratulations Hadoop team!
<a href="http://ydn.zenfs.com/blogs/22/DSC_4160.jpg"><img src="http://ydn.zenfs.com/blogs/22/DSC_4160.jpg" alt="" title="Owen, Sanjay, and Jakob at innovation award presentation" width="320" height="213" class="alignnone size-full wp-image-4821"/></a></p>

<p><strong>Update:</strong> Here is a picture from the Guardian:</p>

<p><a href="http://static.guim.co.uk/sys-images/Guardian/Pix/pictures/2011/3/28/1301305545572/innovator-of-year-apache-hadoop.jpg"><img src="http://static.guim.co.uk/sys-images/Guardian/Pix/pictures/2011/3/28/1301305545572/innovator-of-year-apache-hadoop.jpg"></a></p>

<p>Iain Lee presenting the award to Sanjay, Jakob, and Owen on behalf of the project.</p>
]]></content:encoded>
			<wfw:commentRss>http://developer.yahoo.com/blogs/hadoop/posts/2011/03/hadoop-project-wins-innovation-award/feed/rss2/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	<media:content url="http://ydn.zenfs.com/blogs/22/DSC_4161.jpg" type="image/jpeg" width="1280" height="857" />	</item>
		<item>
		<title>Next Generation of Apache Hadoop MapReduce &#8211; The Scheduler</title>
		<link>http://developer.yahoo.com/blogs/hadoop/posts/2011/03/mapreduce-nextgen-scheduler/</link>
		<comments>http://developer.yahoo.com/blogs/hadoop/posts/2011/03/mapreduce-nextgen-scheduler/#comments</comments>
		<pubDate>Thu, 17 Mar 2011 01:22:26 +0000</pubDate>
		<dc:creator>Arun C Murthy</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://developer.yahoo.com/blogs/hadoop/?p=4141</guid>
		<description><![CDATA[Introduction The previous post in this series covered the next generation of Apache Hadoop MapReduce in a broad sense, particularly its motivation, high-level architecture, goals, requirements, and aspects of its implementation. In the second post in a series unpacking details of the implementation, we’d like to present the protocol for resource allocation and scheduling that [...]]]></description>
			<content:encoded><![CDATA[<h2>Introduction</h2>

<p>The <a href="http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/">previous post</a> in this series covered the next generation of Apache Hadoop MapReduce in a broad sense, particularly its motivation, high-level architecture, goals, requirements, and aspects of its implementation. </p>

<p>In the second post in a series unpacking details of the implementation, we’d like to present the protocol for resource allocation and scheduling that drives application execution on a Next Generation Apache Hadoop MapReduce cluster.</p>

<h2>Background</h2>

<p>Apache Hadoop must scale reliably and transparently to handle the load of a modern, production cluster on commodity hardware. One of the most painful bottlenecks in the MapReduce framework has been the JobTracker, the daemon responsible not only for tracking and managing machine resources across the cluster, but also for enforcing the execution semantics for all the queued and running MapReduce jobs. The fundamental shift we hope to effect takes these two complex and interrelated concepts and re-factors them into separate entities.</p>

<p>In the proposed new architecture a global ResourceManager (RM) tracks machine availability and scheduling invariants while a per-application ApplicationMaster (AM) runs inside the cluster and tracks the program semantics for a given job. An application is either a single MapReduce job as the JobTracker supports today, it could be a directed, acyclic graph (DAG) of MapReduce jobs, or it could be a new framework. Each machine in the cluster runs a per-node daemon, the NodeManager (NM), responsible for enforcing and reporting the resource allocations made by the RM and monitoring the lifecycle of processes spawned on behalf of an application. Each process started by the NM is conceptually a container, or a bundle of resources allocated by the RM.</p>

<p><img src="http://ydn.zenfs.com/blogs/22/MapReduce_NextGen.jpg" alt="Next Generation Apache Hadoop MapReduce Architecture" title="Next Generation Apache Hadoop MapReduce Architecture" width="624" height="386" class="aligncenter size-full wp-image-4481"/></p>

<h2>Scheduler</h2>

<p>The ResourceManager is the central authority that arbitrates resources among various competing applications in the cluster. </p>

<p>The ResourceManager has two main components:</p>

<ul>
<li>Scheduler </li>
<li>ApplicationsManager</li>
</ul>

<p>The Scheduler is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc. The Scheduler is pure scheduler in the sense that it performs no monitoring or tracking of status for the application. Also, it offers no guarantees on restarting failed containers either due to application failure or hardware failures. </p>

<p>The Scheduler performs its scheduling function based the resource requirements of the applications. Applications express their resource requirements using the abstract notion of <strong>resource request</strong>, which incorporates elements such as memory, cpu, disk, network etc. </p>

<p>The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for executing the application-specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure. </p>

<p>This post focuses on providing details about the Scheduler, the resource-model, the scheduling algorithm etc.</p>

<h3>Resource Model</h3>

<p>MapReduce NextGen supports a very generic resource model that allows cluster resources such as CPU, memory, disk and network bandwidth, local-storage etc. to be efficiently scheduled among applications.</p>

<p>The result of a resource request is a <strong>container</strong>. A container is a conceptual entity that grants an application the privilege to use a certain amount of resources on a given machine to run a component task. </p>

<p>The Scheduler, in v1, models every machine in the system as composed of multiple containers (with a lower-limit on the size of memory, say 512MB or 1 GB). </p>

<p>The ApplicationMaster request a container  along with the resources (amount of memory, cpu capability, etc) it needs. It can request multiple type of containers, each with its different resource needs. In addition it can optionally specify location preferences such as specific machine or rack; this allows for first class support of the MapReduce model of performing the computation close to the data being processed. </p>

<p>Thus, unlike current implementation of Hadoop Map-Reduce, the cluster is not artificially segregated into map and reduce slots. Every container is fungible, yielding huge benefits for cluster utilization. Currently, a well-known problem with Hadoop MapReduce is that jobs are bottle-necked on reduce slots and the lack of fungible resources is a severe limiting factor. </p>

<h3>Resource Request</h3>

<p>As described in the previous section, the ApplicationMaster can ask for resources with specific capabilities (memory, cpu etc.) on specific machines or racks. The ApplicationMaster can ask also ask for different kinds of resources for differing needs (e.g. maps and reduces have different resource necessities in the MapReduce paradigm). Among various resource requests, the ApplicationMaster can use request priorities to influence ordering of containers allocated by the Scheduler (e.g. run maps before reduces in MapReduce).
All outstanding resource requests are serviced strictly in priority order, as defined by the application.</p>

<p>The ApplicationMaster can ask for any number of containers of a given request type (<em>&lt;priority, capability></em> ).</p>

<p>Thus, the resource request understood by the Scheduler is of the form:</p>

<p><code>&lt;priority, (hostname/rackname/*), capability, #containers></code></p>

<h3>Resource Negotiation</h3>

<p>The ApplicationMaster is responsible for computing the resource requirements of the application. In a MR application, the input-splits express its locality preferences. The MR ApplicationMaster translates them into resource requests for the Scheduler.</p>

<p>The expected mode of operation is for applications to present all of their resource requests upfront, on start-up, to the extent feasible. Of course, the protocol allows for applications to present updates to the resource requests via deltas.</p>

<p>As described above the resource request is of the form: <em>&lt;priority, (hostname/rackname/*), capability, #containers></em>.</p>

<p>Thus, for MapReduce applications, the MapReduce ApplicationMaster takes the input-splits and generates resource requests by inverting the split to host mappings, then annotating each request with priorities (map v/s reduce containers) and number of containers required on each host. The resource requests are also aggregated by racks and then by the special any (*) for all containers. All resource requests are subject to change via the delta protocol.</p>

<p>An illustrative example:</p>

<p>Consider a MapReduce job with 2 maps, each requiring 1G of RAM and 1 reduce requiring 2G of RAM.</p>

<p>The input-splits are:
<a href="http://ydn.zenfs.com/blogs/22/input_splits.jpg"><img src="http://ydn.zenfs.com/blogs/22/input_splits.jpg" alt="Input Splits" title="Input Splits" width="397" height="115" class="aligncenter size-full wp-image-4541"/></a></p>

<p>The corresponding resource requests from the ApplicationMaster to the Scheduler:</p>

<p><a href="http://ydn.zenfs.com/blogs/22/initial.jpg"><img src="http://ydn.zenfs.com/blogs/22/initial.jpg" alt="Initial ResourceRequests" title="Initial ResourceRequests" width="344" height="319" class="aligncenter size-full wp-image-4571"/></a></p>

<p>Notice how the containers field is aggregated across the different hosts, racks and <strong>any (*)</strong> for both maps and reduces.</p>

<p>The main advantage of the proposed model is that it is extremely compact in terms of the amount of state necessary, per application, on the ResourceManager for scheduling and the amount of information passed around between the ApplicationMaster &amp; ResourceManager. This is crucial for scaling the ResourceManager. The amount of information, per application, in this model is always <code>O(cluster size)</code>, whereas in the current Hadoop Map-Reduce JobTracker it is <code>O(number of tasks)</code> which could run into hundreds of thousands of tasks.  For large jobs it is sufficient to ask for containers only on racks and not specific hosts since the ApplicationMaster can use them appropriately since each rack has many appropriate resources (i.e. input splits for MapReduce applications).</p>

<h3>Scheduling</h3>

<p>The Scheduler will try to match appropriate machines for the application; it can also provide resources on the same rack or a different rack if the specific machine is unavailable. The ApplicationMaster might, at times, due to vagaries of busy-ness of the cluster, receive resources that are not the most appropriate or receive them too late; it can then reject them by returning them to the Scheduler without being charged for them.</p>

<p>The Scheduler is aware of the network topology of the cluster and uses the racks and any (*) fields as gating factors to ensure that ApplicationMasters do not get inappropriate resources without resorting to needing very detailed task-level information. </p>

<p>For e.g. in the above example the Scheduler doesn’t need to know that <em>map0</em> can run on <code>&lt;r11/h1001, r22/h2121, r31/h3118></code>. </p>

<p>If a container is available on <code>r22/h2121</code>, the Scheduler allocates it to the application and invalidates requests on <code>h2121</code>, <code>r22</code> and <code>*</code> by updating the resource request list as follows:</p>

<p><a href="http://ydn.zenfs.com/blogs/22/map0_inputsplit.jpg"><img src="http://ydn.zenfs.com/blogs/22/map0_inputsplit.jpg" alt="map0 container" title="map0 container" width="397" height="115" class="aligncenter size-full wp-image-4601"/></a></p>

<p><a href="http://ydn.zenfs.com/blogs/22/map0.jpg"><img src="http://ydn.zenfs.com/blogs/22/map0.jpg" alt="map0" title="map0" width="344" height="319" class="aligncenter size-full wp-image-4611"/></a></p>

<p>The ApplicationMaster can then, optionally, invalidate requests for <code>h1001</code>, <code>h3118</code>, <code>r11</code> and <code>r31</code> after it decides to run <em>map0</em> on <code>h2121</code>.</p>

<p>Now, if a container is available on <code>r11/h1010</code>, the Scheduler allocates it to the application and invalidates requests on <code>h1010</code>, <code>r11</code> and <code>*</code> as follows:</p>

<p><a href="http://ydn.zenfs.com/blogs/22/map1_inputsplit.jpg"><img src="http://ydn.zenfs.com/blogs/22/map1_inputsplit.jpg" alt="map1 inputsplit" title="map1 inputsplit" width="397" height="115" class="aligncenter size-full wp-image-4621"/></a></p>

<p><a href="http://ydn.zenfs.com/blogs/22/map1.jpg"><img src="http://ydn.zenfs.com/blogs/22/map1.jpg" alt="map1" title="map1" width="344" height="319" class="aligncenter size-full wp-image-4631"/></a></p>

<p>The ApplicationMaster can then, optionally, invalidate requests for <code>h2121</code>, <code>h4123</code>, <code>r22</code> and <code>r45</code> after it decides to run <em>map1</em> on <code>h1010</code>.</p>

<p>The important detail about the scheduling policy is that:</p>

<ul>
<li>When the number of required containers reaches 0 on a <em>host</em>, the Scheduler will stop granting further containers on that node to a the application at that priority/capability. </li>
<li>When the number of required containers reaches 0 on a <em>rack</em>, the Scheduler will stop granting further containers to any node on that rack to the application at that priority/capability. </li>
<li>When <code>*</code> reaches 0 for an application, the Scheduler stops granting further containers on any node for the application at that priority/capability.</li>
</ul>

<p>The Scheduler aggregates resource requests from all the running applications to build a global plan to allocate resources. The Scheduler then allocates resources based on application-specific constraints such as appropriate machines and global constraints such as capacities of the application, queue, user etc.</p>

<p>The CapacityScheduler plug-in uses familiar notions of Capacity and Capacity Guarantees as the policy for arbitrating resources among competing applications. </p>

<p>The scheduling algorithm is straightforward: </p>

<ul>
<li>Pick the most under-served queue in the system.</li>
<li>Pick the highest priority application in that queue.</li>
<li>Serve the application's resource asks appropriately.</li>
</ul>

<p>The FairScheduler plug-in uses fairness the main policy for arbitrating resources among active applications.</p>

<p>Note: A rogue ApplicationMaster could, in any scenario, request more containers than necessary - the Scheduler uses application-limits, user-limits, queue-limits etc. to protect the cluster from abuse.</p>

<h3>Scheduler API</h3>

<p>There is a single API between the Scheduler and the ApplicationMaster:</p>

<p><code> (List &lt;Container> newContainers, List &lt;ContainerStatus> containerStatuses) 
allocate (List &lt;ResourceRequest> ask, List&lt;Container> release) </code></p>

<p>The AM ask for specific resources via a list of ResourceRequests (ask) and releases unnecessary Containers which were allocated by the Scheduler.</p>

<p>The response contains a list of newly allocated Containers and the statuses of application-specific Containers that completed since the previous interaction between the AM and the RM.</p>

<h3>Resource Monitoring</h3>

<p>The Scheduler receives periodic information about the resource usages on allocated resources from the NodeManagers. The Scheduler also makes available status of completed Containers to the appropriate ApplicationMaster. </p>

<p>In future, obvious enhancements to the Scheduler include using the per-container resource usage information provided by the NodeManagers to do over-committing of resources etc.</p>

<h2>Conclusions</h2>

<p>We believe the proposed ideas for the Scheduler in the Next Generation of Hadoop Map-Reduce are both novel and crucial for the fine-grained scheduling required to meet our stated goals for Apache Hadoop MapReduce. They allow the Scheduler to scale to very large clusters and still provide very fine-grained and efficient scheduling for exploiting data-locality, performance and reliability.</p>

<p>We expect and encourage other cluster schedulers/managers to incorporate our ideas and we look forward to learning from their experiences too.</p>
]]></content:encoded>
			<wfw:commentRss>http://developer.yahoo.com/blogs/hadoop/posts/2011/03/mapreduce-nextgen-scheduler/feed/rss2/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
<!-- p4.ydn.sp1.yahoo.com uncompressed/chunked Thu Feb 23 06:20:25 PST 2012 -->

