Join Us at the 10th Annual Hadoop Summit / DataWorks Summit, San Jose (Jun 13-15)
<figure data-orig-width="360" data-orig-height="356" class="tmblr-full"><img src="https://66.media.tumblr.com/a9c7bb70f0a58fafa9f90032db85a609/tumblr_inline_oqdlyagIkK1t17fny_540.png" alt="image" data-orig-width="360" data-orig-height="356"/></figure><p><b></b></p><p>We’re excited to co-host the 10th Annual<a href="https://dataworkssummit.com/san-jose-2017/" target="_blank"> Hadoop Summit,</a> the leading conference for the<a href="http://hadoop.apache.org/" target="_blank"> Apache Hadoop</a> community, taking place on June 13 – 15 at the<a href="http://sanjosemeetings.com/" target="_blank"> San Jose Convention Center</a>. In the last few years, the Hadoop Summit has expanded to cover all things data beyond just Apache Hadoop – such as data science, cloud and operations, IoT and applications – and has been aptly renamed the <a href="https://dataworkssummit.com/" target="_blank">DataWorks Summit</a>. The three-day program is bursting at the seams! Here are just a few of the reasons why you cannot miss this must-attend event:</p><ul><li>Familiarize yourself with the cutting edge in Apache project developments from the committers<br/></li><li>Learn from your peers and industry experts about innovative and real-world use cases, development and administration tips and tricks, success stories and best practices to leverage all your data – on-premise and in the cloud – to drive predictive analytics, distributed deep-learning and artificial intelligence initiatives<br/></li><li>Attend one of our more than 170 technical deep dive <a href="https://dataworkssummit.com/san-jose-2017/agenda/" target="_blank">breakout sessions</a> from nearly 200 speakers across eight tracks<br/></li><li>Check out our keynotes, meetups, trainings, technical crash courses, birds-of-a-feather sessions, Women in Big Data and more<br/></li><li>Attend the <a href="https://dataworkssummit.com/san-jose-2017/sponsors/" target="_blank">community showcase where you can network with</a> sponsors and industry experts, including a host of <a href="https://dataworkssummit.com/san-jose-2017/sponsors/" target="_blank">startups and large companies</a> like Microsoft, IBM, Oracle, HP, Dell EMC and Teradata</li></ul><p>Similar to previous years, we look forward to continuing Yahoo’s decade-long tradition of thought leadership at this year’s summit. Join us for an in-depth look at Yahoo’s Hadoop culture and for the latest in technologies such as Apache Tez, HBase, Hive, Data Highway Rainbow, Mail Data Warehouse and Distributed Deep Learning at the <a href="https://dataworkssummit.com/san-jose-2017/agenda/" target="_blank">breakout sessions</a> below. Or, stop by <b>Yahoo kiosk #700</b> at the <a href="https://dataworkssummit.com/san-jose-2017/sponsors/" target="_blank">community showcase</a>.</p><p>Also, as a co-host of the event, Yahoo is pleased to offer a<b> 20% discount for the summit with the code YAHOO20</b>. Register <a href="https://dataworkssummit.com/san-jose-2017/attend/" target="_blank">here</a> for<a href="https://dataworkssummit.com/san-jose-2017/attend/" target="_blank"> </a>Hadoop Summit, San Jose, California!</p><hr><p><b>DAY 1. TUESDAY June 13, 2017</b><br/></p><p><b><br/></b></p><blockquote><p><b>12:20 - 1:00 P.M. TensorFlowOnSpark - Scalable TensorFlow Learning On Spark Clusters</b></p><p><i>Andy Feng - VP Architecture, Big Data and Machine Learning</i></p><p><i>Lee Yang - Sr. Principal Engineer</i></p><p>In this talk, we will introduce a new framework, TensorFlowOnSpark, for scalable TensorFlow learning, that was open sourced in Q1 2017. This new framework enables easy experimentation for algorithm designs, and supports scalable training & inferencing on Spark clusters. It supports all TensorFlow functionalities including synchronous & asynchronous learning, model & data parallelism, and TensorBoard. It provides architectural flexibility for data ingestion to TensorFlow and network protocols for server-to-server communication. With a few lines of code changes, an existing TensorFlow algorithm can be transformed into a scalable application.</p></blockquote><p><br/></p><blockquote><p><b>2:10 - 2:50 P.M. Handling Kernel Upgrades at Scale - The Dirty Cow Story</b></p><p><i>Samy Gawande - Sr. Operations Engineer</i></p><p><i>Savitha Ravikrishnan - Site Reliability Engineer</i></p><p>Apache Hadoop at Yahoo is a massive platform with 36 different clusters spread across YARN, Apache HBase, and Apache Storm deployments, totaling 60,000 servers made up of 100s of different hardware configurations accumulated over generations, presenting unique operational challenges and a variety of unforeseen corner cases. In this talk, we will share methods, tips and tricks to deal with large scale kernel upgrade on heterogeneous platforms within tight timeframes with 100% uptime and no service or data loss through the Dirty COW use case (privilege escalation vulnerability found in the Linux Kernel in late 2016).</p></blockquote><p><br/></p><blockquote><p><b>5:00 – 5:40 P.M. Data Highway Rainbow - Petabyte Scale Event Collection, Transport, and Delivery at Yahoo</b></p><p><i>Nilam Sharma - Sr. Software Engineer</i></p><p><i>Huibing Yin - Sr. Software Engineer</i></p><p>This talk presents the architecture and features of Data Highway Rainbow, Yahoo’s hosted multi-tenant infrastructure which offers event collection, transport and aggregated delivery as a service. Data Highway supports collection from multiple data centers & aggregated delivery in primary Yahoo data centers which provide a big data computing cluster. From a delivery perspective, Data Highway supports endpoints/sinks such as HDFS, Storm and Kafka; with Storm & Kafka endpoints tailored towards latency sensitive consumers.</p></blockquote><p><b><br/></b></p><p><b>DAY 2. WEDNESDAY June 14, 2017</b></p><p><b><br/></b></p><blockquote><p><b>9:05 - 9:15 A.M. Yahoo General Session - Shaping Data Platform for Lasting Value</b></p><p><i>Sumeet Singh – Sr. Director, Products</i></p><p>With a long history of open innovation with Hadoop, Yahoo continues to invest in and expand the platform capabilities by pushing the boundaries of what the platform can accomplish for the entire organization. In the last 11 years (yes, it is that old!), the Hadoop platform has shown no signs of giving up or giving in. In this talk, we explore what makes the shared multi-tenant Hadoop platform so special at Yahoo.</p></blockquote><p><br/></p><blockquote><p><b>12:20 - 1:00 P.M. CaffeOnSpark Update - Recent Enhancements and Use Cases</b></p><p><i>Mridul Jain - Sr. Principal Engineer </i></p><p><i>Jun Shi - Principal Engineer</i></p><p>By combining salient features from deep learning framework Caffe and big-data frameworks Apache Spark and Apache Hadoop, CaffeOnSpark enables distributed deep learning on a cluster of GPU and CPU servers. We released CaffeOnSpark as an open source project in early 2016, and shared its architecture design and basic usage at Hadoop Summit 2016. In this talk, we will update audiences about the recent development of CaffeOnSpark. We will highlight new features and capabilities: unified data layer which multi-label datasets, distributed LSTM training, interleave testing with training, monitoring/profiling framework, and docker deployment.</p></blockquote><p><br/></p><blockquote><p><b>12:20 - 1:00 P.M. Tez Shuffle Handler - Shuffling at Scale with Apache Hadoop</b></p><p><i>Jon Eagles - Principal Engineer </i></p><p><i>Kuhu Shukla - Software Engineer</i></p><p>In this talk we introduce a new Shuffle Handler for Tez, a YARN Auxiliary Service, that addresses the shortcomings and performance bottlenecks of the legacy MapReduce Shuffle Handler, the default shuffle service in Apache Tez. The Apache Tez Shuffle Handler adds composite fetch which has support for multi-partition fetch to mitigate performance slow down and provides deletion APIs to reduce disk usage for long running Tez sessions. As an emerging technology we will outline future roadmap for the Apache Tez Shuffle Handler and provide performance evaluation results from real world jobs at scale.</p></blockquote><p><br/></p><blockquote><p><b>2:10 - 2:50 P.M. Achieving HBase Multi-Tenancy with RegionServer Groups and Favored Nodes</b></p><p><i>Thiruvel Thirumoolan – Principal Engineer </i></p><p><i>Francis Liu – Sr. Principal Engineer</i></p><p>At Yahoo! HBase has been running as a hosted multi-tenant service since 2013. In a single HBase cluster we have around 30 tenants running various types of workloads (ie batch, near real-time, ad-hoc, etc). We will walk through multi-tenancy features explaining our motivation, how they work as well as our experiences running these multi-tenant clusters. These features will be available in Apache HBase 2.0.</p></blockquote><p><br/></p><blockquote><p><b>2:10 - 2:50 P.M. Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse</b></p><p><i>Nick Huang – Director, Data Engineering, Yahoo Mail </i></p><p><i>Saurabh Dixit – Sr. Principal Engineer, Yahoo Mail</i></p><p>Since 2014, the Yahoo Mail Data Engineering team took on the task of revamping the Mail data warehouse and analytics infrastructure in order to drive the continued growth and evolution of Yahoo Mail. Along the way we have built a 50 PB Hadoop warehouse, and surrounding analytics and machine learning programs that have transformed the way data plays in Yahoo Mail. In this session we will share our experience from this 3 year journey, from the system architecture, analytics systems built, to the learnings from development and drive for adoption.</p></blockquote><p><br/></p><p><b>DAY3. THURSDAY June 15, 2017</b></p><p><b><br/></b></p><blockquote><p><b>2:10 – 2:50 P.M. OracleStore - A Highly Performant RawStore Implementation for Hive Metastore</b></p><p><i>Chris Drome - Sr. Principal Engineer </i></p><p><i>Jin Sun - Principal Engineer </i></p><p>Today, Yahoo uses Hive in many different spaces, from ETL pipelines to adhoc user queries. Increasingly, we are investigating the practicality of applying Hive to real-time queries, such as those generated by interactive BI reporting systems. In order for Hive to succeed in this space, it must be performant in all aspects of query execution, from query compilation to job execution. One such component is the interaction with the underlying database at the core of the Metastore. As an alternative to ObjectStore, we created OracleStore as a proof-of-concept. Freed of the restrictions imposed by DataNucleus, we were able to design a more performant database schema that better met our needs. Then, we implemented OracleStore with specific goals built-in from the start, such as ensuring the deduplication of data. In this talk we will discuss the details behind OracleStore and the gains that were realized with this alternative implementation. These include a reduction of 97%+ in the storage footprint of multiple tables, as well as query performance that is 13x faster than ObjectStore with DirectSQL and 46x faster than ObjectStore without DirectSQL.</p></blockquote><p><br/></p><blockquote><p><b>3:00 P.M. - 3:40 P.M. Bullet - A Real Time Data Query Engine</b></p><p><i>Akshai Sarma - Sr. Software Engineer</i></p><p><i>Michael Natkovich - Director, Engineering</i></p><p>Bullet is an open sourced, lightweight, pluggable querying system for streaming data without a persistence layer implemented on top of Storm. It allows you to filter, project, and aggregate on data in transit. It includes a UI and WS. Instead of running queries on a finite set of data that arrived and was persisted or running a static query defined at the startup of the stream, our queries can be executed against an arbitrary set of data arriving after the query is submitted. In other words, it is a look-forward system. Bullet is a multi-tenant system that scales independently of the data consumed and the number of simultaneous queries. Bullet is pluggable into any streaming data source. It can be configured to read from systems such as Storm, Kafka, Spark, Flume, etc. Bullet leverages Sketches to perform its aggregate operations such as distinct, count distinct, sum, count, min, max, and average.</p></blockquote><p><br/></p><blockquote><p><b>3:00 P.M. - 3:40 P.M. Yahoo - Moving Beyond Running 100% of Apache Pig Jobs on Apache Tez</b></p><p><i>Rohini Palaniswamy - Sr. Principal Engineer</i></p><p>Last year at Yahoo, we spent great effort in scaling, stabilizing and making Pig on Tez production ready and by the end of the year retired running Pig jobs on Mapreduce. This talk will detail the performance and resource utilization improvements Yahoo achieved after migrating all Pig jobs to run on Tez. After successful migration and the improved performance we shifted our focus to addressing some of the bottlenecks we identified and new optimization ideas that we came up with to make it go even faster. We will go over the new features and work done in Tez to make that happen like custom YARN ShuffleHandler, reworking DAG scheduling order, serialization changes, etc. We will also cover exciting new features that were added to Pig for performance such as bloom join and byte code generation. </p></blockquote><p><br/></p><blockquote><p><b>4:10 P.M. - 4:50 P.M. Leveraging Docker for Hadoop Build Automation and Big Data Stack Provisioning</b></p><p><i>Evans Ye, Software Engineer</i></p><p>Apache Bigtop as an open source Hadoop distribution, focuses on developing packaging, testing and deployment solutions that help infrastructure engineers to build up their own customized big data platform as easy as possible. However, packages deployed in production require a solid CI testing framework to ensure its quality. Numbers of Hadoop component must be ensured to work perfectly together as well. In this presentation, we’ll talk about how Bigtop deliver its containerized CI framework which can be directly replicated by Bigtop users. The core revolution here are the newly developed Docker Provisioner that leveraged Docker for Hadoop deployment and Docker Sandbox for developer to quickly start a big data stack. The content of this talk includes the containerized CI framework, technical detail of Docker Provisioner and Docker Sandbox, a hierarchy of docker images we designed, and several components we developed such as Bigtop Toolchain to achieve build automation.</p></blockquote><p><b>Register here for<a href="https://dataworkssummit.com/san-jose-2017/attend/" target="_blank"> </a><a href="https://dataworkssummit.com/san-jose-2017/attend/" target="_blank">Hadoop Summit, San Jose, California</a> with a<b> 20% discount code YAHOO20</b>. </b></p><p><b></b></p><p>Questions? Feel free to reach out to us at bigdata@yahoo-inc.com. Hope to see you there!</p>