Time Agenda
8:00 - 9:00 Registration
8:00 - 9:00 Continental Breakfast: Sponsored by Cloudera
All Day Wireless Internet Access: Sponsored by DataStax
9:00 - 9:15 Hadoop at Yahoo!
Jay Rossiter, SVP, Cloud Platform Group, Yahoo!
9:15 - 9:45 5 Years of Hadoop - Lessons learned and plans forward
Eric Baldeschwieler, CEO, Hortonworks
9:45 - 10:15 IBM Watson, Big Data, Hadoop and Ecosystem: An Enterprise Perspective
Anant Jhingran, CTO, Information Management, IBM
10:15 - 10:30 Coffee Break: Sponsored by Datameer
10:30 - 11:00 The Facebook Messages infrastructure - built on HBase, HDFS and MapReduce
Karthik Ranganathan, Facebook
11:00 - 11:30 Crossing the Chasm: Hadoop for the Enterprise
Sanjay Radia, Hortonworks
11:30 - 11:45 The Hadoopler – Powered by NetApp’s long track record of Open Source innovation
Val Bercovici, NetApp (Sponsoerd Talk)
11:45 - 12:00 The Next Generation of MapReduce Analytics Powered by Apache Hadoop
John Schroeder, CEO and Co-founder, MapR Technologies (Sponsored Talk)
12:00 - 1:15 Lunch Break: Sponsored by NetApp
Time Community Track Operations and Experience Track Applications and Research Track
1:15 - 1:45 New HBase Features: Coprocessors and Security
Gary Helmling, Trend Micro
Web Crawl Cache - Using HBase to Manage a Copy of the Web
Yoram Arnon, Yahoo!
Hadoop on a Personal Supercomputer
Paul Dingman, Pervasive Software
1:45 - 2:15 Next Generation Apache Hadoop MapReduce
Arun Murthy, Hortonworks
Data Freeway : Scaling Out Realtime Data
Sam Rash and Eric Hwang, Facebook
Building Kafka and LinkedIn's Data Pipeline
Jay Kreps, LinkedIn
2:15 - 2:45 Introducing HCatalog (Hadoop Table Manager)
Alan Gates, Hortonworks
Automated Rolling OS Upgrades for Yahoo! Hadoop Grids
Dan Romike, Yahoo!
RAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows
Hyunjung Park, Robert Ikeda and Jennifer Widom, Stanford University
2:45 - 3:15 Spark: In-Memory Cluster Computing for Iterative and Interactive Applications
Matei Zaharia, Mosharaf Chowdhury, and Others, UC Berkeley
Data Management on Hadoop Clusters
Seetharam Venkatesh, Yahoo!
Brisk: Truly peer-to-peer Hadoop
Jonathan Ellis, DataStax
3:15 - 3:30 Coffee Break: Sponsored by AsterData
3:30 - 4:00 Join Strategies in Hive
Liyin Tang and Yongqiang He, Facebook
Playing Well With Others: Managing a Multitenant Hadoop Cluster
Eric Sammer, Cloudera
Query Suggestion at scale with Hadoop
Gyanit Singh, Nish Parikh and Neel Sundaresan, eBay
4:00 - 4:30 Hadoop Scale Pub/Sub with Hedwig
Utkarsh Srivastava, Twitter
Data Integrity and Availability of HDFS
Rob Chansler, Yahoo!
Large Scale Math with Hadoop MapReduce
Tsz-Wo Sze, Hortonworks
4:30 - 5:00 HDFS Federation and Other Features
Suresh Srinivas, Hortonworks
Using a Hadoop data pipeline to build a graph of users and content at CBSi
Bill Graham, CBS Interactive
Petabyte Scale Device Support with Hadoop, Hive, and HBase
Ron Bodkin and Scott Fleming, Think Big Analytics and Kumar Palaniappan, NetApp
5:00 - 5:30 Design, Scale and Performance of MapR's Distribution for Apache Hadoop
M. C. Srivas, MapR Technologies
Case Studies of Hadoop Operations at Yahoo!
Charles Wimmer, Yahoo!
Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data
Ed Kohlwey, Jesse Yates and Jason Trost, Booz Allen Hamilton
5:30 - 6:00 Oozie : Scheduling Workflows on the Grid
Mohammad Islam, Yahoo!
Tying the threads together: Hadoop performance monitoring and diagnosis
Henry Robinson, Cloudera
Giraph : Large-scale graph processing on Hadoop
Avery Ching, Yahoo! and Christian Kunz, Jybe
6:00 - 8:00 Evening Reception: Sponsored by MapR Technologies


Abstracts

Hadoop at Yahoo!

Jay Rossiter, SVP, Cloud Platform Group, Yahoo!

Behind every click across Yahoo!, there is a global cloud computing infrastructure, heavily reliant on Hadoop. We see the magic in crunching unimaginable volumes of data to create increasingly relevant Internet experiences for people around the world – and Hadoop is the technology behind that magic.

Jay Rossiter is Senior Vice President of the Cloud Platform Group at Yahoo!. In this role, Jay leads the team that has built one of the world’s largest private cloud systems that powers every click on the Yahoo! network. He has successfully led the creation of critical software platforms and is delivering the next generation of cloud services that redefines how Yahoo! leverages technology and science at scale to enable rapid innovation, sustainable product differentiation and new monetization opportunities in the marketplace.

Jay is a veteran of the computer industry having worked at organizations such as Bell Laboratories, 3COM Corporation and Oracle Corporation. Prior to joining Yahoo!, Jay was Vice President of the System Management Products Division at Oracle where he built Oracle’s $400M application and system management business from scratch – nearly quadrupling revenues over 4 years.

At Oracle, he was a member of the CEO’s weekly Product Planning and Strategy Committee, and was instrumental in the development of Oracle’s Grid and Fusion strategies. He was also responsible for the overall delivery of the Oracle9i database and for many of Oracle's core networking and security solutions.

Jay holds a Bachelor’s degree in Mathematics from SUNY Binghamton and a Master’s degree in Computer Science from the University of Michigan.

Top

5 Years of Hadoop - Lessons learned and plans forward

Eric Baldeschwieler, CEO, Hortonworks

A review of the last 5 years of Hadoop development, the current state of Hadoop at Yahoo and in the wider community. A discussion of lessons learned and our thinking about future Hadoop improvements.

Eric Baldeschwieler is Vice President of Hadoop at Yahoo!. Eric leads the development team responsible for designing and deploying Hadoop across Yahoo!’s global infrastructure, the world’s largest implementation of the technology. A team of Yahoo!’s, including Eric and Doug Cutting, initiated the Apache Hadoop Project in 2005, and have taken Hadoop from a 20-node prototype to a 40,000-node production deployment.

Top

IBM Watson, Big Data, Hadoop and Ecosystem: An Enterprise Perspective

Anant Jhingran, CTO, Information Management, IBM

Anant Jhingran will talk about the role of Big Data in large enteprises, and IBM's support of the ecosystem. To set the context, he will begin by talking about Watson -- IBM's Jeopardy Q&A system -- and how it works, and the role of Big Data/hadoop as the knowledge processing platform at its heart. Beyond Watson and its applicability within enterprises to solve problems different than Jeopardy's, he will also give examples of how Big Data without Watson is being leveraged in the clients IBM deals with. These problems are sometimes Big, but the bigness is not the only dimension, and Anant will describe some of the other dimensions that are critical for enterprise clients. Finally, Anant will stress the importance of avoiding the fork of the open source hadoop communities, or of code and distros, and the vigilance that is needed at this stage of the technology evolution.

Dr. Anant Jhingran is an IBM Fellow, VP and CTO for IBM's Information Management Division. He is deeply involved in the IBM's middleware in the cloud (including co-chairmanship of IBM's Cloud Computing Architecture Board), and for the technical strategy of products and solutions in databases, content management, business intelligence and information integration. Prior to this job, Dr. Jhingran lead the IBM team designing and building solutions to meet the requirements of business analytics on structured and unstructured data. He has also been the director of Computer Science at the IBM Almaden Research Center and before that, senior manager for e-Commerce and data management at IBM's T. J. Watson Research Center. He received his Ph.D. from UC at Berkeley in 1990.

Top

The Facebook Messages infrastructure - built on HBase, HDFS and MapReduce

Karthik Ranganathan, Facebook

The new version of Facebook Messages combines messages, email, chat & SMS into one real-time conversation. Messages by nature are a write-dominated workload, and combining the different streams of messages only makes it more so. At Facebook, we picked HBase as the storage technology to power the new version of Facebook messages. We use HBase in a real-time fashion to store and serve data for the Facebook messaging product. Hbase is built on top of HDFS and we have had to put in many enhancements to HDFS to serve this real-time workload. We also used map-reduce extensively to migrate all the existing data from the existing databases into HBase and transform it into the desired format. This talk is about our experiences with and usage of Hbase, HDFS and MapReduce at Facebook for messages. The topics I am going to cover include:
- why we picked Apache HBase as the underlying storage technology
- a high level architecture of Facebook messages
- our work with and contributions to Hbase and HDFS
- operational experience & lessons learned running Hbase and MapReduce in production at a large scale
This talk would be very interesting to people planning to use HBase in some capacity, already using HBase in production, interested in large scale data migrations using MapReduce and more generally to people interested in challenges faced in very large scale in production.

Top

Crossing the Chasm: Hadoop for the Enterprise

Sanjay Radia, Hortonworks

Since its inception, Hadoop has grown rapidly at Yahoo and elsewhere. On several fronts, we were charting, in the early days, new territories in terms of abstractions, APIs, scale, etc. The Hadoop project made a number of choices that helped the project gain acceptance, evolve and gain a reasonable customer base. Many of the choices/tradeoffs were appropriate at a time when Hadoop was a small project, with a very small team, used by forward-looking customers who needed a new platform to solve their problems. Over the last couple of years, Hadoop’s customer base has grown dramatically; furthermore, their expectations are dramatically different than the early adopters.

This talk examines the approaches the Hadoop team chose for compatibility, scalability, availability, data integrity, and quality/testing in terms how well it served us during the early days, some of the problems we face currently, and the changes we have made and are making to cross the chasm we are facing.

Top

The Hadoopler – Powered by NetApp’s long track record of Open Source innovation

Val Bercovici, NetApp

Big Data is not a vendor term. Based on our distinguished experience contributing to mainstream open source projects, NetApp believes Big Data originated in open source. After hardening the existing NFS (v3) subsystem of the Linux Kernel, driving the innovation of pNFS 4.1 on Linux and contributing a production-quality hypervisor to the BSD kernel, NetApp looks forward to our relationship with the Hadoop Community via the Apache Software Foundation. Val will introduce the NetApp project affectionately known as the Hadoopler, which is our first solution for the Hadoop market. Come see Val describe how NetApp makes Big Data as simple as (A)nalytics, (B)andwidth and (C)ontent.

Top

The Next Generation of MapReduce Analytics Powered by Apache Hadoop

John Schroeder, CEO and Co-founder, MapR Technologies

As Hadoop adoption continues to explode, there is a growing number of commercial vendors joining the Apache community and innovating around Hadoop. Organizations are looking for the next set of innovations to expand the depth and breadth of Hadoop applications. This lightning talk provides a brief overview of innovations around ease of use, dependability, and performance, and includes specific MapR examples of NFS support for streaming analytics, high availability, and snapshots for data protection.

Top

New HBase Features: Coprocessors and Security

Gary Helmling, Trend Micro

Among the latest HBase features are two major changes in how you can interact with stored data: coprocessors and security. First we will explore the coprocessor framework, which allows user code to run directly in the HBase processes, providing efficient distributed processing over local data and the ability to extend HBase functionality in new ways.

Next we will look at HBase security, built using coprocessors, as both an example of using the new framework, and as a feature in its own right. HBase security extends the Hadoop security framework to provide strong authentication and access control over HBase tabular data. Combined with secure HDFS and Map Reduce, this enables a complete solution for multi-tenancy and data isolation in HBase applications.

Top

Web Crawl Cache - Using HBase to Manage a Copy of the Web

Yoram Arnon, Yahoo!

Understanding the value of a repository of the entire world wide web, Yahoo and Microsoft agreed that as part of the Search transition deal, Bing will provide Yahoo with a copy of all the web documents that it crawls, with associated metadata, to allow Yahoo to retain this value once it stops running its own Search crawling operations.

This data can serve to augment the Search experience, identify entities on the web that are important or trending, build applications that rely on a comprehensive view of the web, research of linguistics, social networks and many others. To support this business case, Web Crawl Cache fetches a stream of data from Microsoft, consisting of all of Bing's crawled documents and associated metadata; it processes it, cleaning it up, removing duplicates, and enriching it, then serves it up to applications throughout Yahoo on demand.

WCC consists of 1700 nodes, each with 2 quad core cpu, 6x2TB disks and 24GB RAM. They're set up as two clusters, each running HDFS, MapReduce and HBase. One is used for fetching and basic processing, the other for lower frequency, higher intensity processing. It's the largest cluster in the world today running HBase. The size of our HBase table is 2PB. The high data rate going through the system prompted us to make some non standard choices in how we write, process and access the data.

The presentation will describe our setup, the reasoning behind our choice of Hadoop and HBase, and the design choices we made.

Top

Hadoop on a Personal Supercomputer

Paul Dingman, Pervasive Software

Hadoop is proven as a scalable and distributed execution engine for running data intensive applications on clusters of commodity servers. Scalability is based on large numbers of nodes each with a small number of processor cores and a small number of hard disk spindles.

We are interested in the potential advantages of task parallelism with Hadoop on single many-core machines with large disk arrays. Can Hadoop running on one, or a few, fat nodes perform as well or better than Hadoop running on a cluster with a comparable number of cores and disks?

In this presentation we compare the performance of 48-core AMD (Magny-Cours) systems with an Amazon EC2 cluster using some simple benchmarks. We also look at the different aspects of the system hardware and the tuning of both Linux and Hadoop.

Our results show that many-core systems can deliver faster results, be more cost effective and still leverage the reliability/resiliency provided by Hadoop. Additionally there are opportunities to exploit intra-server parallelism to improve task communication and coordination overhead.

Top

Next Generation Apache Hadoop MapReduce

Arun Murthy, Hortonworks

The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines. We are developing the next generation of Apache Hadoop MapReduce that factors the framework into a generic resource scheduler and a per-job, user-defined component that manages the application execution. Since downtime is more expensive at scale high-availability is built-in from the beginning; as are security and multi-tenancy to support many users on the larger clusters. The new architecture will also increase innovation, agility and hardware utilization.

Top

Data Freeway : Scaling Out Realtime Data

Sam Rash and Eric Hwang, Facebook

Data Freeway is Facebook's system for data transport. It delivers hundreds of terabytes of data per day to our warehouse with peak transfer rates of up to 5 GB/s. At the core of this system is HDFS. We have built tools on top of HDFS for efficiently and conveniently getting data in and out of HDFS for batch and near-realtime applications. Our distributed logging system, consisting of Scribe/Calligraphus, enables us to quickly and reliably import data into this HDFS framework. ZooKeeper integration with Calligraphus has allowed us to scale out our system for automatic failover and dynamic load balancing. ZooKeeper serves the role of a real-time source of truth for task allocation via leader elections and administrative policy.

We have made various modifications to HDFS to accommodate our real-time use case. This comprises of two key features: sync and concurrent reader support. The former allows a writer to guarantee that all data written before a sync call has been persisted to HDFS. The latter enables readers to read data from a block that is simultaneously being written.

On the client side, we have the PTail utility that enables users to view our entire set of HDFS clusters as a single set of category streams. Users are able to specify a category name and obtain a single data stream for that category, hiding away the numerous HDFS clusters and files used to store that data. Additionally, it supports a ‘follow’ mode in the spirit of the unix command: “tail -f”. For larger data streams, PTail has the ability to shard the incoming data into multiple streams for parallel processing. PTail also provides check-pointing capability while streaming data. This allows clients to resume from a fixed point in the event of failures.

The final leg of our data pipeline is the Continuous Copier. We use MapReduce on our warehouse cluster to ensure a specified number of copier tasks exist at any given moment. Each task continuously scans the source HDFS systems for new files and copies them to the warehouse cluster as data streams in. This functions very similarly to how PTail provides direct access to streaming data.

We have recently delved into opening our infrastructure to provide real-time metrics. This part of Data Freeway leverages PTail and performs in-memory computation of counts and uniques. HBase is used in these cases to provide the underlying persistence.

The proposed presentation will give an overview of Data Freeway at Facebook, and go into the details of our HDFS modifications, as well as our Zookeeper usage. It will also cover PTail and our Continuous Copier system.

Top

Building Kafka and LinkedIn's Data Pipeline

Jay Kreps, LinkedIn

Modern web sites are incorporating feeds of user activity data directly into their site features. These features vary from "likes", to "news feeds", to consumer-facing reporting on activity data, to security limitations, to targeting and relevance capabilities. The key ingredient common across all of these types of features is that they directly incorporate high-throughput activity data. These uses co-exist with traditional business reporting and analytics use cases for the same data. Since the volume for this kind of data tends to be orders of magnitude higher than traditional data sources this creates interesting opportunities for data infrastructure to help unify these uses.

LinkedIn has built a unified pipeline to cover data loads in Hadoop as well as real-time data and stream processing for applications including monitoring, relevance, security, and user-facing features.

The key technology behind this in a newly open sourced system, Kafka, which handles distributed stream management. It supports the normal parallel ingestion into Hadoop that would be handled by any log aggregation system. But in addition it also has the ability to partition streams by key to allows for more complex real-time stream aggregation and processing in a map/reduce-like style for applications that demand very low latency.

This talk will cover the architecture of Kafka and give a practical overview of building a Kafka-based data pipeline for LinkedIn. This will include details on all-aspects of building such a pipeline, ranging from Kafka's consumer balancing algorithms, to Avro serialization and message schema management and evolution, to the key-elements of a high performance Hadoop load process, as well as comparisons to other solutions we have tried.

Top

Introducing HCatalog (Hadoop Table Manager)

Alan Gates, Hortonworks

HCatalog, currently an Apache Incubator project, is a table management and storage management layer for Hadoop that enables users with different data processing tools – Pig, MapReduce, Hive, Streaming – to more easily read and write data on the grid. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop distributed file system (HDFS) and ensures that users need not worry about where or in what format their data is stored – RCFile format, text files, sequence files.

In this talk, an overview of HCatalog will be given. Also, the release/roadmap will be covered.

Top

Automated Rolling OS Upgrades for Yahoo! Hadoop Grids

Dan Romike, Yahoo!

Yahoo! first deployed Hadoop on a small set of commodity systems. Today that number has exploded while the support team remains relatively unchanged. We are keeping pace with new Hadoop releases, additional data centers, new grids, and new products, however, we also need to address hardware failures, OS upgrades, security patches, infrastructure changes, and ongoing maintenance. Thus we have an immediate and urgent need to automate standard maintenance. This automation must run quietly, without intervention, while performing tedious critical tasks in a secured environment while not impacting Hadoop jobs, data, or stability. Yahoo! Grid Ops has taken the necessary steps to bridge the workload gap by using swimlane workflows with code generation. These workflows are logically expressed then converted to script and deployed in a Web based harness or Command Line Interface. The first project launched was a ‘rolling OS and security upgrade’ on ten thousand nodes.

This presentation will detail how we are upgrading thousands of servers, the problems of system state management, and the operational workflows specific to a Hadoop grid environment.

Top

RAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows

Hyunjung Park, Robert Ikeda and Jennifer Widom, Stanford University

Debugging MapReduce workflows can be a difficult task: their execution is batch-oriented and, once completed, leaves only the data sets themselves to help in the debugging process. Data provenance, which captures how data elements are processed through the workflow, can aid in debugging by enabling backward tracing: finding the input subsets that contributed to a given output element. For example, erroneous input elements or processing functions may be discovered by backward-tracing suspicious output elements. Provenance and backward tracing also can be useful for drilling-down to learn more about interesting or unusual output elements.

RAMP (Reduce And Map Provenance) is an extension to Hadoop that supports provenance capture and tracing for MapReduce workflows. RAMP captures fine-grained provenance by wrapping the RecordReader, Mapper, Combiner, Reducer, and RecordWriter. This wrapper-based approach is transparent to Hadoop, retaining Hadoop’s parallel execution and fault tolerance. Furthermore, in many cases users need not be aware of provenance capture while writing MapReduce jobs--wrapping is automatic, and RAMP stores provenance separately from the input and output data. Our performance experiments show that RAMP imposes reasonable time and space overhead during provenance capture and enables efficient backward tracing without requiring special indexing of provenance information.

[1] H. Park, R. Ikeda, and J. Widom. RAMP: A System for Capturing and Tracing Provenance in MapReduce Workflows. Technical Report, March 2011. [2] R. Ikeda, H. Park, and J. Widom. Provenance for Generalized Map and Reduce Workflows. Proceedings of the Fifth Biennial Conference on Innovative Data Systems Research (CIDR), January 2011.

Top

Spark: In-Memory Cluster Computing for Iterative and Interactive Applications

Matei Zaharia, Mosharaf Chowdhury, and Others, UC Berkeley

MapReduce and its variants have been highly successful in supporting large-scale data-intensive applications. However, these systems are based on an acyclic data flow model that does not capture other important use cases. We present Spark, a new cluster computing framework motivated by one such class of use cases: applications that reuse a working set of data in multiple parallel operations. This includes many iterative machine learning and graph algorithms, as well as interactive data mining tools. Spark augments the data flow model with fault-tolerant, in-memory distributed datasets to efficiently support applications with working sets, leading to up to 20x better performance than Hadoop. Spark also makes developing jobs easy by integrating into the Scala programming language. Finally, Spark's ability to load a dataset into memory and query it repeatedly makes it especially suitable for interactive analysis of big datasets. We have modified the Scala interpreter to make it possible to use Spark interactively, allowing users to query multi-gigabyte datasets with sub-second latency.

Spark is being used in production at Conviva to run analytics reports on Hive data 40x faster than the previous Hive implementation, and at UC Berkeley to run large-scale spam filtering and traffic prediction algorithms. The system is open source under a BSD license. Spark is compatible with all Hadoop input and output formats (HDFS, HBase, etc), and can be run alongside Hadoop through the Mesos resource manager to complement MapReduce as a data analysis tool.

Top

Data Management on Hadoop Clusters

Seetharam Venkatesh, Yahoo!

Yahoo services in all areas – advertising, Web search, content – depend for their quality on analysis of data fed back from the serving systems. For example, logs from Web servers and other front-end as well as back-end systems. This is joined with dimensional data along with catalogs and indexes used to serve ads, search results, and customized content.

The Hadoop Cluster makes orders of magnitude more of this data available in one place together with extensive computation resources. This enables teams to continuously optimize and improve the serving quality. As the number of data sources and data sets to be mirrored on the Hadoop Clusters has increased, the need for a common platform offering robust automation has become apparent. This data is collected and brought to the Hadoop Cluster using multiple steps in a pipeline. Our goal is to make sure this data reaches Hadoop Cluster in accordance with committed SLAs for latency and fidelity.

The Data Management solution will automate the movement (Data In, Out, & Copy) and lifecycle management of data (Retention, Anonymization, Compliance Archival, etc.) on the Yahoo Hadoop Clusters. The solution addresses the problem of loading thousands of distinct data sets to a growing number of clusters in multiple data centers. It meets latency and data quality SLAs while requiring minimal operational staff and allows scaling with Hadoop. It helps the vast majority of Hadoop Cluster users depending on regular data availability with increased reliability.

Top

Brisk: Truly peer-to-peer Hadoop

Srisatish Ambati, Jonathan Ellis and Ben Werther, DataStax

Brisk is an open-source Hadoop & Hive distribution that uses Apache Cassandra for its core services and storage. Brisk makes it possible to run Hadoop MapReduce on top of CassandraFS, an HDFS-compatible storage layer. By replacing HDFS with CassandraFS, users leverage MapReduce jobs on Cassandra’s peer-to-peer, fault-tolerant and scalable architecture.

With CassandraFS all nodes are peers. Data files can be loaded through any node in the cluster and any node can serve as the JobTracker for MapReduce jobs. Hive MetaStore is stored & accessed as just another column family (table) on the distributed data store. Brisk makes Hadoop truly peer-to-peer.

We demonstrate visualisation & monitoring of Brisk using OpsCenter. The operational simplicity of cassandra’s multi-datacenter & multi-region aware replication makes Brisk well-suited for a rich set of Applications and usecases. And by being able to store and isolate hdfs & online data within the same data cluster, Brisk makes analytics possible without ETL!

Top

Join Strategies in Hive

Liyin Tang and Yongqiang He, Facebook

Hive’s join operations consume the biggest chunk of Facebook data warehouse cluster’s resources. In a lot of cases, some big join jobs make the cluster unstable as they occupy too many resources. This presentation examines all join optimizations added in hive and also have been proven practical. The first one is stream join, which automatically identifies the most efficient join order. The second one is the map join technique, which avoids reducers completely in a join. The basic idea is that if there is only one big table and all other tables can small enough to fit into memory, hive will just launch a map-only, and each mapper will copy over all small tables. After deployment, 20% of all join queries running in Facebook’s warehouse have been converted to map join. And we saw a 57%~163% performance improvements by the map join technique. And in order to decide a map join automatically, hive added a lot of techniques to make the auto convert map join possible. This presentation will go through these techniques and lessons that we have learned from them. To get more cases to be converted to map join, hive added some advanced join techniques, like bucket map join and bucket sort merge join. The bucket sort merge join is the essential technique used for incremental scraping data in Facebook. Another example is that the execution time of one large critical join job has been reduced from more than 6 hours to 1 hour by applying these optimizations. This presentation will go through the pros and cons of these techniques. This presentation will also introduce a skew join technique, which helps with cases of extremely uneven data distributions. We believe that all techniques and experiences can benefit a lot of other Hadoop applications.

Top

Playing Well With Others: Managing a Multitenant Hadoop Cluster

Eric Sammer, Cloudera

As Hadoop moves from niche uses within companies to a mainstream, company wide resource serving disparate groups with varying SLAs, many users struggle to understand how to effectively share a single large cluster. In this talk, we address the key concerns and best practices around data organization and sharing, identity management and access control, cluster usage monitoring and attribution, and scheduling.

This presentation highlights both the advantages as well as the common stumbling blocks when transitioning from a single user or group model to a shared, multitenant cluster environment. Concerns around how one effectively integrates with existing identity management and authentication infrastructure, and how to properly make use of HDFS permissions to enforce data access policies are extremely common for users new to Hadoop as an IT resource. Many users also do not see how task scheduling can effect SLA compliance and the importance of schedulers like the Capacity and Fair Scheduler.

Cluster and job level monitoring can identify improper utilization of the available resources and aid in capacity planning. Also, proactive monitoring can identify trends that would become problems in the future if left unchecked. For many users seen in the wild, this can reduce downtime or intradepartmental resource starvation.

Top

Query Suggestion at scale with Hadoop

Gyanit Singh, Nish Parikh and Neel Sundaresan, eBay

Many internet companies are collecting several terabytes of data as clickstream and query logs. At these companies quality of various products depends on analysis and processing of these large data set. Hadoop is one of the leading technology that has enabled to unlock this potential at scale. Query suggestion, one of such problem is presented in this talk. Query suggestion is an integral part of every search engine. It helps users narrow or broaden their searches. In the context of e-commerce search engines dynamic inventory combined with long and sparse tail of query distribution poses unique challenges to build a query suggestion method. Noisy and biased data also effects the performance of the algorithm.

In this talk we present mechanism used to generate query suggestion module (Hasan et. al. WSDM'11 ). We describe how Hadoop enabled tackling the above mentioned problems quickly and at scale. We also describe the performance and learning’s gained from exposing our query suggestion module to a vibrant community of millions of users.

Top

Hadoop Scale Pub/Sub with Hedwig

Utkarsh Srivastava, Twitter

Topic-based publish/subscribe is a common primitive in large distributed systems. Messages are published to a topic by a number of publishers and are received in a defined order by subscribers of that topic. Hedwig follows the Hadoop philosophy of building scalable fault tolerant systems using commodity hardware. By building on ZooKeeper, a reliable coordination service, and BookKeeper, a distributed Write-Ahead Log, we were able remove any single point of failure in Hedwig. Hedwig and BookKeeper are part of the new BookKeeper subproject of ZooKeeper. Hedwig was designed in research to address the problem of cross datacenter maintenance of the replicated tablets of PNUTS, a key/value store similar to HBase.

In this presentation we will motivate the design of Hedwig. For example, PNUTS runs across datacenters, so we need to be able to enable low latency message publication even if not all datacenters are available or some datacenters may be slow. PNUTS also has a large number of tablets, so we need to scale to hundreds of thousands of topics. Not only does this mean that our Hedwig message brokers need to have good throughput, but we also need to be able to scale horizontally using tens to hundreds of brokers.

We will explain how we meet our requirements by explaining the design of Hedwig and its API. In particular we will review the durability and ordering guarantees of Hedwig and show how it deals with failures. Finally, we will show some performance numbers to get an idea of the scalability of Hedwig. We hope this presentation will serve as an introduction of one of the newest subproject in the Hadoop ecosystem.

Top

Data Integrity and Availability of HDFS

Rob Chansler, Yahoo!

Is the Hadoop Distributed File System a good custodian for our data? HDFS at Yahoo! manages 140 Petabytes of company assets using 40,000 hosts across two dozen clusters. Here we review how well HDFS has provided durability and availability for the fifty million new files created each day.

HDFS uses replication as a basic strategy for ensuring the durability of data. It is probable that Yahoo! has not lost any data replicated three times due to the uncorrelated failure of cluster nodes. But these carefully chosen words hide interesting behavior that is important when trying to understand the durability of data. We’ll look at how HDFS has been imperfect when caring for our data.

Availability is another dimension for judging the success of HDFS at managing enterprise-scale data stores. In 500 days, there were 60 incidents when some HDFS name space was unexpectedly unavailable. We’ll categorize those incidents, and consider strategies for reducing interruptions of service for users.

Top

Large Scale Math with Hadoop MapReduce

Tsz-Wo Sze, Hortonworks

The MapReduce model excels at many common big-data problems, particularly scans, text processing, and machine learning. However, distributed applications outside its original design space have not enjoyed comparable enthusiasm despite the wide deployment of Apache Hadoop clusters. Scientific and mathematical calculation, in particular, boast a deep treatment in distributed computing literature, but algorithms for performing large-scale calculations in MapReduce are neither widely available as libraries, nor even well studied. In this presentation, we discuss MapReduce algorithms for large-scale mathematical calculation, their implementation, and our experience in running and tuning these computations in Hadoop clusters.

Efficiently multiplying two large (terabit!) integers is one of the most elementary problems in mathematics or computer science. Beyond its direct applications, other common integer and floating point operations such as division, square root, and logarithm can be implemented using an efficient integer multiplication algorithm. We have designed MapReduce-SSA, an integer multiplication algorithm using the ideas from the Schönhage-Strassen algorithm (SSA) on MapReduce. As parts of MapReduce-SSA, two algorithms, MapReduce-FFT and MapReduce-Sum, are created for computing discrete Fourier transforms and summations, respectively. Our prototype implementation, DistMpMult, is able to efficiently multiply terabit integers on Hadoop clusters.

Using the developing version of our MapReduce Math Library, we computed the two quadrillionth bits of pi (the mathematical constant) using Hadoop in July 2010, establishing a new world record. The news was reported extensively in the media, including BBC News, New Scientist magazine, CNN Money Tech, Communications of the ACM and Computing Now (IEEE). The world record computation was composed of 35,000 MapReduce jobs, requiring 23 days of real time and 503 years of CPU time in Yahoo! clusters. In addition to the partitioning typical of any distributed computation, we will discuss more idiosyncratic challenges, such as scheduling these jobs over a long period of time in a shared cluster of commodity hardware and assembling the result from these calculations.

* The presentation is based on the following published (or accepted for publication) papers.

[1] Tsz-Wo Sze. Schönhage-Strassen Algorithm with MapReduce for Multiplying Terabit Integers. Symbolic-Numeric Computation 2011, to appear. Preprint available at http://people.apache.org/~szetszwo/ssmr20110430.pdf

[2] Tsz-Wo Sze. The Two Quadrillionth Bit of Pi is 0! Distributed Computation of Pi with Apache Hadoop. In IEEE 2nd International Conference on Cloud Computing Technology and Science (CloudCom), pages 727-732, 2010. (Earlier versions available at http://arxiv.org/abs/1008.3171)

Top

HDFS Federation and Other Features

Suresh Srinivas, Hortonworks

Scalability of the NameNode has been a key struggle. HDFS cluster uses a single namenode to store the entire file system metadata in memory and to process metadata operations. Growth in cluster storage is currently handled by increasing the namenode process size. This limits the cluster size to what can be accommodated in memory on a single namenode. Further, with better scalability offered by the Next Generation Hadoop MapReduce, the metadata operations on the single namenode will see increased demand, making nameservice a bottleneck for the MapReduce framework. HDFS federation horizontally scales the nameservice using multiple federated namenodes/namespaces. The federated namenodes share the datanodes in the cluster as a common storage layer. HDFS federation also adds client-side namespaces to provide a unified view of the file system.

Top

Using a Hadoop data pipeline to build a graph of users and content at CBSi

Bill Graham, CBS Interactive

Opportunity:
CBS Interactive (formerly CNET Networks) is the 9th most-highly visited web property on the Internet(1). We have hundreds of terabytes of user and content data from dozens of web properties stored in various system across the organization. By combining these datasets we can better understand how our users interact with our content.

Solution:
We will present an approach currently being developed at CBSi that aggregates activity, content, and user data from disparate systems into a single Hadoop cluster running HBase. Once the data is aggregated, Pig is used to create RDF triples representing the explicit and implicit relationships between user and content entities. The RDF triples are then loaded into a graph database where the graph can be effectively explored using SPARQL, an RDF query language.

Benefits:
This approaches combines scalable and maintainable data processing with powerful and flexible analytics. The inherent scalability characteristics of Hadoop and HBase enable us to efficiently process both high-volume bulk loads and incremental updates. In addition to enabling a wide range of more traditional Hadoop analytics using Pig, Python and Hive, we can now support RDF analytical languages like SPARQL and complex graph algorithms. Using Pig UDFs makes interacting with HBase and generating new RDF triples easy, enabling rapid experimentation and agile implementation.

Take-aways:
Hadoop MapReduce, HBase and graph databases are each optimized to support different types of use cases and access patterns. By combining these three technologies to form a data pipeline the strengths of each technology can be harnessed to build and explore graphs of relationships.

1 - Comscore, March 2011 (http://www.freshnews.com/news/484168/comscore-media-metrix-ranks-top-50-u-s-web-properties-march-2011)

Top

Petabyte Scale Device Support with Hadoop, Hive, and HBase

Ron Bodkin and Scott Fleming, Think Big Analytics and Kumar Palaniappan, NetApp

NetApp is a fast growing leader in storage technology. In order to provide timely support for customers and to allow product planning, its devices send auto-support log data to be collected and analyzed. This data volume has grown, reaching 5 TB of compressed data per week. Traditionally NetApp has been storing flat files on disk volumes and keeping summary data in relational databases. NetApp worked with Think Big Analytics to pilot the use of Hadoop for managing auto-support data. Key requirements include:
* Query data in seconds within 5 minutes of event occurrence
* Do complex ad hoc queries to investigate issues and plan
* Build models to predict support and capacity limits to react before issues aris

In this session we look at the design to
* Collect 1000 messages of 20MB compressed per minute. This uses a fan-out configuration for Flume, reusing Perl parsers, writing large data sets into HDFS, updating HBase tables for current status, and creating cases for high priority issues. It also uses Java MapReduce jobs that process data downstream.
* Store 2 PB of incoming support events by 2015.
* Low latency access to support information and configuration changes in HBase at scale within 5 minutes of event arrival.
* Support complex ad hoc queries that join multiple data sets, using custom UDF's to correlate JSON data. These queries benefit from partitioned and indexing in Hive and can query tens of Terabytes of data.
* Operate efficiently at scale

Top

Design, Scale and Performance of MapR's Distribution for Apache Hadoop

M. C. Srivas, MapR Technologies

The talk discusses the design, scale and performance of MapR's Distribution of Apache Hadoop. We cover the following topics:
- Motivation: why build yet another Hadoop distribution?
- Distributed name-node architecture
- Transactions, consistency and reliability
- Programming model, how it affects performance
- Scalability factors and map/reduce shuffle optimizations
- Performance across a variety of loads like DFSIO read/write, Terasort, NN-Bench, YCSB-HBase, and other benchmarks

Top

Case Studies of Hadoop Operations at Yahoo!

Charles Wimmer, Yahoo!

The Grid Operations team at Yahoo! operates about 40,000 servers running Hadoop in clusters of up to 4,200 servers. Operating at this scale, we have encountered issues. These issues may be specific to large installations, but will likely apply to many Hadoop installations.

This presentation will be a series of case studies that exemplify these issues.

Each case study will include:
* How the situation was encountered
* Root cause
* Solution implemented
* Lessons learned

Possible cases for inclusion:
* LDAP as a directory service for Hadoop
* BIOS updates
* Multitenancy Hadoop clusters
* Hard disk upgrades
* Facilities issues in new datacenters

Top

Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Ed Kohlwey, Jesse Yates and Jason Trost, Booz Allen Hamilton

Secondary indexing is a common design pattern in BigTable-like databases that allows users to index one or more columns in a table. This technique enables fast search of records in a database based on a particular column instead of the row id, thus enabling relational-style semantics in a NoSQL environment. This is accomplished by representing the index either in a reserved namespace in the table or another index table. Despite the fact that this is a common design pattern in BigTable-based applications, most implementations of this practice to date have been tightly coupled with a particular application. As a result, few general-purpose frameworks for secondary indexing on BigTable-like databases exist, and those that do are tied to a particular implementation of the BigTable model.

We developed a solution to this problem called Culvert that supports online index updates as well as a variation of the HIVE query language. In designing Culvert, we sought to make the solution pluggable so that it can be used on any of the many BigTable-like databases (HBase, Cassandra, etc.). We will discuss our experiences implementing secondary indexing solutions over multiple underlying data stores, and how these experiences drove design decisions in creating the Culvert framework. We will also discuss our efforts to integrate HIVE on top of multiple indexing solutions and databases, and how we implemented a subset of HIVE's query language on Culvert.

Top

Oozie : Scheduling Workflows on the Grid

Mohammad Islam, Yahoo!

Oozie is an open-source workflow scheduling system to manage data processing jobs for Apache Hadoop. It currently supports two levels of abstractions for application development. The first level, called Oozie workflow management, enables users to specify job dependency in a directed acyclic graph (DAG). The second level, called Oozie coordinator, enables users to schedule any workflow based on time frequency or/and data dependency.

Oozie 3.0 introduces a new abstraction called Oozie bundle. Bundle enables users to batch a set of coordinator applications. More specifically, bundle enables users to define and execute a bunch of coordinator applications, which together are often called a data pipeline. Users will also be able to start/stop/suspend/resume/rerun in the bundle level. In addition, Oozie 3.0 includes enhancements to the operability, stability and scalability of Oozie servers that will benefit all users. In particular, the redesigning of coordinator job’s status and internal queue processing will significantly reduces the support overhead.

Oozie has improved a lot in the last few years; however, the diversified use cases originating from various large data pipeline indicate a need for further improvement. Oozie currently uses the namenode polling to verify data dependency. This is the main obstacle to Oozie’s scalability. In a future release, Oozie will address this issue by integrating with a meta-data system (e.g. HCatalog) to get the data availability message without polling. Oozie will also support automatic failover without human intervention.

The main objective of this presentation is to inform the Hadoop community about the state of the Oozie project, to review the challenges and achievements from the previous year, and to highlight the new features and future development plans. Another objective is to solicit feedback directly from users about Oozie’s current features and Oozie’s future direction.

Top

Tying the threads together: Hadoop performance monitoring and diagnosis

Henry Robinson, Cloudera

A Hadoop deployment is a complex thing. The interaction of a hierarchy of different systems - from threads, JVMs, tasktrackers and datanodes, the jobtracker and the namenode, individual machines, racks to even multiple clusters give rise to a very complex set of behaviours that make it hard to understand the performance and problems of your cluster. How does a set of slow hard disks affect cluster utilisation? How does one correlate the exceptions thrown by a tasktracker with the failure of a particular job? How do we identify which signal is cause, and which is effect?

We're building a system (Cloudera Activity Monitor) designed to help with some of these operational questions. We'll discuss the challenges of making sense of input from thousands of different sources, and the solutions we have found to finding and visualising meaningful signal from amongst the noise. We'll compare our approach to existing monitoring solutions (Ganglia, Nagios et. al.), give some examples of use from the real world and discuss some promising areas of future development.

Top

Giraph : Large-scale graph processing on Hadoop

Avery Ching, Yahoo! and Christian Kunz, Jybe

Web and online social graphs have been rapidly growing in size and scale during the past decade. In 2008, Google estimated that the number of web pages reached over a trillion. Online social networking and email sites, including Yahoo!, Google, Microsoft, Facebook, LinkedIn, and Twitter, have hundreds of millions of users and are expected to grow much more in the future. Processing these graphs plays a big role in relevant and personalized information for users, such as results from a search engine or news in an online social networking site.

Graph processing platforms to run large-scale algorithms (such as page rank, shared connections, personalization-based popularity, etc.) have become quite popular. Some recent examples include Pregel and HaLoop. For general-purpose big data computation, the map-reduce computing model has been well adopted and the most deployed map-reduce infrastructure is Apache Hadoop. We have implemented a graph-processing framework that is launched as a typical Hadoop job to leverage existing Hadoop infrastructure, such as Amazon’s EC2. Giraph builds upon the graph-oriented nature of Pregel but additionally adds fault-tolerance to the coordinator process with the use of ZooKeeper as its centralized coordination service and is in the process of being open-sourced.

In this talk we describe the Giraph architecture and infrastructure. Similar to Pregel, Giraph follows the bulk-synchronous parallel model relative to graphs where vertices can send messages to other vertices during a given superstep. Checkpoints are initiated by the Giraph infrastructure at user-defined intervals and are used for automatic application restarts when any worker in the application fails. Any worker in the application can act as the application coordinator and one will automatically take over if the current application coordinator fails. In addition to discussing Giraph infrastructure, we describe our early experiences using Giraph to do feature generation on web-of-objects data at Yahoo!.

Top