• hadoop-elephantWe are proud to announce we used Apache Hadoop to set a new Gray sort record for the Jim Gray's Sort benchmark. We nearly doubled the rate of the previous Gray sort entry by sorting at a rate of 1.42 Terabytes per minute. The previous record was 0.725 Terabytes per minute.

    Jim Gray's sort benchmark consists of a set of many related benchmarks, each with their own rules. All of the sort benchmarks measure the time to sort different numbers of 100 byte records. The first 10 bytes of each record is the key and the rest is the value. The Gray sort is to measure the sort rate achieved while sorting at least 100 terabytes of data. The Minute sort is the amount of data that can be sorted in less than a minute. There are two different benchmark categories. The Daytona category requires the sort code to be general purpose sort. The Indy category needs to only sort 100-byte records with 10-byte keys. We used Hadoop Terasort with slightly different configurations in both categories.

    There were

    Read More »from Hadoop at Yahoo! Sets New Gray Sort Record – The Yellow Elephant is Getting Faster
  • Storm-YARN Released as Open Source

    By Bobby Evans and Andy Feng (@afeng76)

    At Yahoo! we have worked on the convergence of Storm with Hadoop, as mentioned in our earlier post. We are pleased to announce that Storm-YARN has been released as open source. Storm-YARN enables Storm applications to utilize the computational resources in a Hadoop cluster along with accessing Hadoop storage resources such as HBase and HDFS.


    Collocating real-time processing with batch processing offers a number of advantages over segregated clusters.

    • It provides a huge potential for elasticity. Real-time processing will rarely produce a constant and predictable load. As such, Storm needs more resources to keep up with spikes in demand. Collocating Storm with batch processing allows Storm to steal resources from batch jobs when needed and give them back when demand subsides. The Storm-YARN effort lays the groundwork to make this possible.
    • Many applications use Storm for low-latency processing and Map/Reduce for batch processing while
    Read More »from Storm-YARN Released as Open Source
  • Apache HBase at Yahoo! – Multi-tenancy at the Helm Again

    By Francis Liu and Sumeet Singh

    In 2009-2010, Yahoo! saw an unprecedented growth in the number of users coming onboard to its Apache Hadoop platform for their data processing and analytics needs. We attribute a majority of that success and increase in user base to the introduction of multi-tenancy, security, and partitioned namespaces in Hadoop.

    Screen Shot 2013-06-07 at 1.23.42 PM
    With Hadoop and its ecosystem components like Apache Pig and Apache Oozie getting popular at Yahoo!, we needed a solution to store mutable data and support random access to the stored data to complement the Apache Hadoop platform. Yahoo! had been using Apache HBase in isolated instances, most notably for the CORE personalization platform and for the web crawl cache at the time. However, the use of Apache HBase was limited to large projects that had the resources to operate dedicated HBase clusters.

    In 2012, Yahoo! developed multi-tenancy in Apache HBase to cater to a growing number of use cases where HBase was an excellent fit as part of its

    Read More »from Apache HBase at Yahoo! – Multi-tenancy at the Helm Again
  • Join Us for the 6th Annual Hadoop Summit in San Jose, CA


    Hortonworks and Yahoo! are pleased to host the 6th Annual Hadoop Summit, the leading conference for the Apache Hadoop community to be held on June 26-27, 2013 at the San Jose Convention Center. Hadoop Summit, the two-day event, will feature many of the Apache Hadoop thought leaders who will showcase successful Hadoop use cases, share development and administration tips and tricks, and educate organizations about how best to leverage Apache Hadoop as a key component in their enterprise data architecture. This event will also be an excellent networking event for developers, architects, administrators, data analysts and data scientists interested in advancing and extending Apache Hadoop.

    Popular sessions include:

    • Applied Hadoop
    • Scaling Big Data Mining Infrastructure: the Twitter Experience
    • HDFS, What's New and Future
    • Past, Present and Future of Data Processing in Apache Hadoop
    • Analysing 1.4 Trillion Events with Hadoop
    • Hadoop Operations at LinkedIn
    • Enterprise Integration of Disruptive
    Read More »from Join Us for the 6th Annual Hadoop Summit in San Jose, CA
  • Hadoop at Yahoo!: More Than Ever Before

    Hadoop ElephantHadoop ElephantA lot has changed at Yahoo! last year. We have new leaders, we gained millions in new audience*, we saw engagement gains from Social Bar, and we released several successful mobile apps such as Flickr and Yahoo! Mail. But with all that change, there is one thing that has remained constant, and that is our commitment to pioneering new ground for Hadoop.

    I was well aware of the rich legacy behind Hadoop at Yahoo! when I started in the Cloud Engineering Group about eight months ago. What I was perhaps not fully aware of was the talent and energy of our engineering team (watch the 2 min Hadoop Summit 2012 video), a team eager to push the scale and efficiency boundaries of Hadoop for delivering tangible business results for Yahoo!. We have really come together as a customer-focused group with tight alignment on our strategy, vision, and roadmap with continued commitment to stay true to Apache Software Foundation and contribute 100% of our development work back into the community.

    Hadoop at

    Read More »from Hadoop at Yahoo!: More Than Ever Before
  • Hadoop Summit 2011 – A Different Approach

    Hadoop Summit 2011 is over. If you saw this tweet ”#hadoopsummit planned for 1,500. upped on demand to 1,600. finally accommodated 1,700. ran out of space, good problem to have. :-),” then you probably got an idea of how exciting and mobbed the conference was this year. With folks dropping by from coast-to-coast, and quite a few from around the world, Hadoop Summit 2011 will quite likely be the year’s largest Hadoop gathering. But even more so, because of the passion of everyone that participated, it was also the best Hadoop gathering of the year, raising the bar yet again for Hadoop technical content and networking.

    At the Summit and since it ended, I have received questions from folks who attended the show and some who couldn’t make it. In general, a lot of people were curious about what went into developing the Summit and the approach we took to the Summit. I thought I’d take some time today and summarize my thoughts on this topic.

    Obviously, in conference planning, a lot of the

    Read More »from Hadoop Summit 2011 – A Different Approach
  • Fourth Annual Hadoop Summit: The Countdown Begins!

    On June 29, Yahoo! will host the 4th annual Hadoop Summit at the Santa Clara Convention Center. Hadoop Summit 2011 brings together some of the most influential thought leaders in the space - from Yahoo, Facebook, IBM, NetApp, and others.

    Jay Rossiter, Senior Vice President of the Yahoo! Cloud Platform Group will open the show with a keynote around how Yahoo! is developing the next generation of Hadoop applications to handle big data, the important role that Hadoop plays in Yahoo!’s integrated technology ecosystem and how wide industry adoption of Hadoop is benefiting the entire community.

    Also on the main stage, Facebook will discuss its use of Hadoop to power the Facebook Messages infrastructure and IBM will discuss how they used Hadoop to power supercomputer, Watson.

    Additional conference highlights include some key sessions:

    * Next Generation Apache Hadoop MapReduce: Arun Murthy, Yahoo!’s lead architect on the Hadoop Map-Reduce development team, will lead a discussion on the next

    Read More »from Fourth Annual Hadoop Summit: The Countdown Begins!
  • Slides from eric14 talks @ #IbmBigData

    Hi Folks,

    Here are my slides from the IBM big data symposium. This was a good event. IBM announced a new release of their Apache Hadoop based Big Insights platform. It is great to hear their commitment to Apache. Yahoo was there talking about our experiences and uses of Hadoop. I got a lot of questions about why we invest in Hadoop, so let me point you back to my post on that and our commitment to Apache Hadoop. (http://yhoo.it/e8p3Dd and http://yhoo.it/i9Ww8W)


    Read More »from Slides from eric14 talks @ #IbmBigData
  • Hadoop Summit CFP closing tomorrow!

    Stack and I are the track organizers for the community track at the Hadoop Summit this year. The community track is for presentations on roadmap, developments and features in Apache Hadoop. So if you've added a new feature to Hadoop and want to publicize it to the world's largest and most important Hadoop conference, please submit it!


    The deadline is 6 May, which is tomorrow!

  • Call for participation in the Hadoop Summit Research Track

    Hadoop Summit is a great annual gathering of developers to talk about all things Hadoop. The attendance is great, we are expecting 2000 this year; the presentations are excellent; and the hallway conversations are a great way to meet new people and come up with new ideas.

    This environment is especially great if you have a great idea that you would like to share with the community. You will have a great audience of knowledgeable developers that you can try to convince to help you to take your work to the next level. Doesn't it sound ... great!?!

    Milind and I are organizing the research and application track. If you have built some new framework on top of Hadoop or made Hadoop better, let us know. We will be selecting the most interesting results for the research and application track.

    General information for the Hadoop Summit is at http://hadoopsummit.org. You can submit an abstract for your presentation at https://developer.yahoo.com/events/hadoopsummit2011/presentationguidelines.html

    Read More »from Call for participation in the Hadoop Summit Research Track


(104 Stories)