Yahoo Developer Network

Latest Blogs

June 16, 2016

Celebrate a Decade of Excellence with Apache Hadoop and Save 20% off Registration at Hadoop Summit 2016 San Jose

<hr><figure class="tmblr-full" data-orig-height="640" data-orig-width="640"><img src="https://66.media.tumblr.com/ddf0791df6e8b48f70c51c673f64dd0e/tumblr_inline_o8vvekNaIs1t17fny_540.png" data-orig-height="640" data-orig-width="640"/></figure>We are excited to co-host the 9th Annual<a href="http://2016.hadoopsummit.org/san-jose" target="_blank"> Hadoop Summit,</a> the leading conference for the<a href="http://hadoop.apache.org/" target="_blank"> Apache Hadoop</a> community, taking place on June 28-30 at the McEnery Convention Center in San Jose, California. This year’s Hadoop Summit features more than 200 speakers across 9 tracks and over 170 breakout sessions where attendees will learn about innovative use cases, development and administration tips and tricks, the cutting edge in project developments from the committers, and how the community is driving and accelerating Hadoop’s global adoption.The Summit is expected to bring together more than 5,500 community members, presenting excellent opportunities for software developers, architects, administrators, data analysts, and data scientists to learn from each other in advancing, extending or implementing Hadoop.Much like the prior years, we continue Yahoo’s decade long tradition and thought leadership with Apache Hadoop at the 2016 Summit. If you are in attendance, come encourage fellow Yahoos as they showcase their work on the latest in Hadoop and related Big Data technologies such as Apache Storm, Tez, HBase, Hive, Oozie and Distributed Deep Learning.<hr>DAY 1. TUESDAY June 28, 2016 <blockquote>12:20 - 1:00 P.M. Faster, Faster, Faster!: The True Story of a Mobile Analytics Data Mart on Hive Mithun Radhakrishnan – Principal Engineer, Apache Hive Committer, Josh Walters – Sr. Software Engineer As Yahoo continues to grow and diversify its mobile products, its platform team faces an ever-expanding, sophisticated group of analysts and product managers, who demand faster answers: to measure key engagement metrics like Daily Active Users, app installs and user retention; to run longitudinal analysis; and to re-calculate metrics over rolling time windows. This talk will examine the efficacy of using Hive for large-scale mobile analytics.</blockquote> <blockquote>3:00 – 4:00 P.M. Investigating the Effects of Over Committing YARN Resources Jason Lowe – Distinguished Engineer, Apache Hadoop and Tez PMC and CommitterYARN requires applications to specify the size (in MB and VCores) of the containers it wishes to utilize in the application. Applications need to request sufficient resource so that containers never run out - leading to significant amounts of unutilized resource. In cases of n-thousand node clusters, this can result in millions of dollars of unused capacity. The YARN community is actively working to address this problem. In the shorter term Yahoo has developed a simple approach that quickly provides useful insights into both the efficacy of over-committing resources, as well as some of the key issues that may be encountered. This talk will describe the dynamic over-commit implementation that Yahoo is running at scale, along with results and pitfalls encountered. </blockquote> DAY 2. WEDNESDAY June 29, 2016 <blockquote>9:00 -11:00 A.M. Yahoo KeynotePeter Monaco – VP, Engineering, Communications ProductsThis Keynote will address how Mail and Communications applications at Yahoo have used Hadoop and its ecosystem components successfully. </blockquote> <blockquote>12:20 - 1:00 P.M. Performance Comparison of Streaming Big Data PlatformsReza Farivar – Capital One, Apache Storm Contributor, Kyle Nusbaum – Software Engineer, Apache Storm PMCYahoo has been using Storm extensively, and the number of nodes running Storm has reached about 2,300 (and is still growing). However, several noteworthy competitors, including Apache Flink and Apache Spark Streaming, are gaining attention. To choose the best streaming tools for our needs, we decided to write a benchmark as close to real-world use cases as possible. In this session, we will examine how these streaming platforms performed against our benchmark tests, and discuss which is most appropriate for your big data real-time streaming needs.</blockquote> <blockquote>2:10 - 2:50 P.M. Yahoo’s Next-Generation User Profile PlatformKai Liu – Sr. Engineering Manager, Lu Niu – Software EngineerUser profile is crucial to the success of any targeting and personalization systems. There are hundreds of billions of user events that have been receiving at Yahoo everyday. These events contain a variety of user activities including app usages, page views, search queries, ad views, ad clicks etc. In this presentation, we’ll talk about how we designed a modern user profile system using a hybrid architecture, that supports fast data ingestion, random access and interactive ad-hoc query. We’ll show you how we build the system with Spark, HBase, Impala to achieve these goals. </blockquote> <blockquote>3:00 - 3:40 P.M. Omid: A Transactional Framework for HBaseFrancisco Perez-Sorrosal – Research Engg., Omid Committer, Ohad Shacham – Research Scientist, Omid CommitterOmid is a high performant ACID transactional framework with Snapshot Isolation for HBase. Omid doesn’t require any HBase modification. Most NoSQL databases do not provide OLTP support, and give up transactional support for greater agility and scalability. However, fault tolerant transactions are essential in many applications in the Hadoop ecosystem, especially in incremental content processing systems. Omid enables these applications to benefit from both, scalability provided by NoSQL datastores and concurrency and atomicity provided by transaction processing. Omid is now open-source. It provides a reliable, high-performant and easy-to-program platform, capable of serving transactional web-scale applications based on HBase.</blockquote> <blockquote>3:00 - 3:40 P.M. Building and Managing Data Pipelines with Complex Dependencies Using Apache OoziePurushotam Shah – Senior Software Engineer, Apache Oozie PMC and CommitterAt Yahoo, Apache Oozie is the standard for building and operating large-scale data pipelines and is responsible for over 80% of 34 million monthly jobs processed on the Hadoop platform. In this talk, we will describe the anatomy of a typical data pipeline and how Apache Oozie meets the demands of large-scale data pipelines. In particular, we will focus on recent advancements in Oozie for dependency management among pipeline stages, incremental and partial processing, combinatorial, conditional and optional processing, priority processing, late processing and reprocessing, SLA monitoring, administration, and BCP management. We will conclude the talk with enhancement ideas for future releases.</blockquote> <blockquote>5:50 – 6:30 P.M. Resource Aware Scheduling in StormBoyang (Jerry) Peng – Software Engineer, Apache Storm PMC and CommitterApache Storm is one of the most popular stream processing systems in industry today and is the primary platform used for stream processing at Yahoo. However, Storm, like many other stream processing systems, lacks an intelligent scheduling mechanism. So we designed and implemented resource-aware scheduling in Storm. The Resource-Aware Scheduler (RAS) uses specialized scheduling algorithms to maximize resource utilization, while minimizing network latency when scheduling applications. Multi-tenant support has already been added to RAS. In this presentation, we will introduce Resource-Aware Scheduling (RAS) in Storm and discuss how it has improved the performance of Storm and enabled Yahoo to overcome key challenges in operating stream processing systems in multi-tenant and heterogeneous environments. </blockquote> DAY3. THURSDAY June 30, 2016<blockquote>9:00 – 11:00 A.M. Yahoo KeynoteMark Holderbaugh – Sr. Director, Engineering, HadoopThis Keynote will cover the interesting features introduced by Hadoop team at Yahoo, like Dynamic Over Commit for better resource utilization on clusters, Pig-on-Tez, and Resource Aware Scheduler (RAS) in Storm. </blockquote> <blockquote>11:30 A.M. - 12:10 P.M. Yahoo’s Experience Running Pig on Tez at ScaleRohini Palaniswamy – Sr. Principal, Apache Pig, Oozie, Tez PMC, Jon Eagles – Principal, Apache Hadoop, Tez PMCYahoo has always been one of the first to adopt emerging Hadoop technologies, stabilize and run at web-scale in production well ahead of mainstream adoption - then Apache YARN and now Apache Tez. Yahoo has the largest footprint of Apache Pig with tens of thousands of scripts that power ETL and Machine Learning for major Yahoo properties. We have been migrating our scripts to run on Tez to capitalize on the orders of magnitude performance and huge savings on resource consumption. In this session, we will present how the effort paid off with actual performance and SLA numbers from production jobs and analyze aggregate cluster utilization graphs from before and after the migration. We will share our learning on running Tez at scale successfully, and share our experience in making this paradigm shift from Mapreduce to Tez. </blockquote> <blockquote>12:20 - 1:00 P.M. Distributed Deep Learning on Hadoop Clusters Andy Feng – VP, Architecture, Apache Storm PMC, Jun Shi – Principal Engineer At Yahoo we recently introduced distributed deep learning as a new capability of Hadoop clusters. These new clusters augment our existing CPU nodes and Ethernet connectivity with GPU nodes and Infiniband connectivity. We developed a distributed deep learning solution, CaffeOnSpark, that enables deep learning tasks to be launched via spark-submit command, as in any Spark application. In this talk, we will provide a technical overview of CaffeOnSpark, and explain how that conducts deep learning in a private cloud or public cloud (such as AWS EC2). We will share our experience at Yahoo through use cases (including photo auto tagging), and discuss the areas of collaboration with open source communities for Hadoop-based deep learning. </blockquote> <blockquote>2:10 – 2:50 P.M. Managing Hadoop, HBase, and Storm Clusters at Yahoo Scale Dheeraj Kapur, Principal Engineer, Savitha Ravikrishnan, Operations Engineer Hadoop at Yahoo is a massive infrastructure and a challenging platform to manage. We have come a long way from full downtime to now no longer requiring any downtime for upgrades and cater to massive workloads in our 40+ clusters in the ecosystem spread across multiple data centers. Things get even more complex with multi-tenancy, differing workload characteristics, and strict SLAs on Hadoop, HBase, Storm and other Support Services. We will talk about rolling upgrades, and automation & tools we have built to manage a massive grid infrastructure with support for multi-tenancy and full CI/CD. </blockquote> <blockquote>3:00 P.M. - 3:40 P.M. A Performance and Scalability Review of Apache Hadoop on Commodity HW Configurations Sumeet Singh – Sr. Director, Cloud and Big Data Platforms, Rajiv Chittajallu – Sr. Principal EngineerSince its humble beginnings in 2006, Hadoop has come a long way in the last 10 years in its evolution as an open-platform. In this talk, we will present a comprehensive review of Hadoop’s performance and scalability to validate how well the original design goals hold true. We intend to present performance and scale metrics from a representative cluster environment with 120 modern servers utilizing standard benchmark tests. Our focus will be on HDFS, YARN, MapReduce, Tez, Pig-on-Tez, Hive-on-Tez, and HBase. We intend to showcase Hadoop’s performance and throughput numbers and how Hadoop fares when it comes to utilizing system resources such as CPU, memory, disk, and network to make the best use of what is available. We will provide similar metrics from a 40,000 server footprint running production workloads so that our audience walk out with a fantastic baseline when it comes to Hadoop performance metrics. </blockquote><hr>Register Now with a Yahoo DiscountAs a co-host for this event, Yahoo is pleased to offer a 20% discount on the registration price. Enter promotional code 16SJspO20 to receive your discount during the registration process.You, or your department, are responsible for the discounted registration fee and any travel expense involved with attending the Hadoop Summit.Register here for<a href="http://2015.hadoopsummit.org/brussels/register/" target="_blank"> </a><a href="http://2016.hadoopsummit.org/san-jose/register/" target="_blank">Hadoop Summit, San Jose, California!</a>

Hadoop Summit