Flickr: hadoopsummit
Twitter: #hadoopsummit
Hyatt Regency, Santa Clara, CA
June 29, 2010
TimeAgenda
08:30-09:00Registration and Coffee
Sponsored by IBM
09:00-10:15Big Data and the Power of Hadoop  [ View Video  video ]
Blake Irving, Executive Vice President and Chief Products Officer, Yahoo!
Hadoop and The Future of Internet Scale Cloud Computing  [ View Video  video ]
Shelton Shugar, Senior Vice President, Cloud Computing, Yahoo!
Scaling Hadoop  [ View Video  video ]
Eric Baldeschwieler, Vice President, Hadoop Software Development, Yahoo!
10:15-10:30Coffee Break
Sponsored by IBM
10:30-11:00Making Hadoop Enterprise Ready with Amazon Elastic MapReduce  [ View Video  video ]
Peter Sirota, General Manager, Elastic Map Reduce
11:00-11:30Hadoop Grows Up  [ View Video  video ]
Doug Cutting, Cloudera
11:30-12:00Inside Large-Scale Analytics at Facebook  [ View Video  video ]
Mike Schroepfer, VP of Engineering, Facebook
12:00-1:30Lunch Break
Co-Sponsored by Amazon Web Services, Cloudera, Datameer, Karmasphere
 Developers TrackApplications TrackResearch Track
1:30-2:00 Hadoop Security in Detail  View Slides
Owen O'Malley, Yahoo!
Disruptive Applications with Hadoop  View Slides
Rod Smith, VP, IBM Emerging Internet Technologies
Design Patterns for Efficient Graph Algorithms in MapReduce  View Slides
Jimmy Lin, Michael Schatz, University of Maryland
2:00-2:30 Hive integration: HBase and Rcfile  View Slides
John Sichi and Yongqiang He, Facebook
ZettaVox: Content Mining and Analysis Across Heterogeneous Compute Clouds  View Slides
Mark Davis, Kitenga
Mining Billion-node Graphs: Patterns, Generators and Tools  View Slides
Christos Faloutsos, Carnegie Mellon University
2:30-3:00 Hadoop and Pig at Twitter  View Slides
Kevin Weil, Twitter
Biometric Databases and Hadoop  View Slides
Jason Trost, Abel Sussman and Lalit Kapoor, Booz Allen Hamilton
XXL Graph Algorithms  View Slides
Sergei Vassilvitskii, Yahoo! Labs
3:00-3:30 Developer's Most Frequent Hadoop Headaches & How to Address Them  View Slides
Shevek Mankin, Karmasphere
Hadoop - Integration Patterns and Practices  View Slides
Eric Sammer, Cloudera
Efficient Parallel Set-Similarity Joins Using Hadoop  View Slides
Chen Li, University of California, Irvine
3:30-4:00Coffee Break
Sponsored by IBM
4:00-4:30 Workflow on Hadoop Using Oozie  View Slides
Alejandro Abdelnur, Yahoo!
Winning the Big Data SPAM Challenge  View Slides
Stefan Groschupf, Datameer; Florian Leibert, Erich Nachbar
Exact Inference in Bayesian Networks using MapReduce  View Slides
Alex Kozlov, Cloudera
4:30-5:00 Cascalog: an Interactive Query Language for Hadoop  View Slides
Nathan Marz, BackType
Data Applications and Infrastructure at LinkedIn  View Slides
Jay Kreps, LinkedIn
Hadoop for Scientific Workloads  View Slides
Lavanya Ramakrishnan, Lawrence Berkeley National Lab
5:00-5:30 Honu - A Large Scale Streaming Data Collection and Processing Pipeline  View Slides
Jerome Boulon, Netflix
Online Content Optimization with Hadoop  View Slides
Amit Phadke, Yahoo!
Hadoop for Genomics  View Slides
Jeremy Bruestle, Spiral Genetics
5:30-6:15 Hadoop Frameworks Panel: Pig, Hive, Cascading, Cloudera Desktop, LinkedIn Voldemort, Twitter ElephantBird  View Slides
Moderator: Sanjay Radia, Yahoo!
Hadoop Customer Panel: Amazon Elastic MapReduce
Moderator: Deepak Singh, Amazon Web Services
Parallel Distributed Image Stacking and Mosaicing with Hadoop  View Slides
Keith Wiley, University of Washington
6:30-8:30Evening Reception
Sponsored by LightSpeed


Abstracts

Big Data and the Power of Hadoop

Blake Irving, Executive Vice President and Chief Products Officer, Yahoo!

Behind every click across Yahoo!, there is a global cloud computing infrastructure, heavily reliant on Hadoop. We see the magic in crunching unimaginable volumes of data to create increasingly relevant Internet experiences for people around the world – and Hadoop is the technology behind that magic.

Blake Irving is Executive Vice President and Chief Products Officer at Yahoo!. Irving leads Yahoo!'s Products organization, which is responsible for the vision, strategy, design and development of Yahoo!'s global consumer and advertiser product portfolio. Irving is focused on building unique and highly personal experiences for Yahoo!'s consumers, delivering on Yahoo!'s promise of Science, Art and Scale to its advertisers, and continuing to deliver more and faster innovations to the market.

Top

Hadoop and The Future of Internet Scale Cloud Computing

Shelton Shugar, Senior Vice President, Cloud Computing, Yahoo!

Looking to the future, Yahoo!’s vision for the cloud is closely intertwined with open source computing. As implementer of the world’s largest Hadoop deployment, Yahoo! serves as the canary in the coal mine signaling emerging trends and challenges that the industry will face as Hadoop adoption continues on its sharp increase.

Shelton Shugar is Senior Vice President of Cloud Computing at Yahoo!. Today nearly all of Yahoo!’s consumer and advertiser experiences rely on the cloud. Yahoo!’s cloud stores and processes Yahoo!’s massive amounts of data and content, delivers it in a reliable and scalable way, making it accessible to users around the world on a variety of platforms and form factors.

Top

Scaling Hadoop

Eric Baldeschwieler, Vice President, Hadoop Software Development, Yahoo!

At Yahoo!, we’re engineering for global scale, and constantly expanding the frontiers of Hadoop. Hear how we’re scaling Hadoop to handle larger clusters, growing application complexity and evolving business needs.

Eric Baldeschwieler is Vice President of Hadoop at Yahoo!. Eric leads the development team responsible for designing and deploying Hadoop across Yahoo!’s global infrastructure, the world’s largest implementation of the technology. A team of Yahoo!’s, including Eric and Doug Cutting, initiated the Apache Hadoop Project in 2005, and have taken Hadoop from a 20-node prototype to a 35,000-node production deployment.

Top

Making Hadoop Enterprise Ready with Amazon Elastic MapReduce

Peter Sirota, General Manager, Elastic Map Reduce

As the demand for cloud-based analysis of large data sets explodes, customers of Amazon Web Services have wanted to leverage Hadoop in the cloud. Last year we introduced Amazon Elastic MapReduce, which manages the complex and time-consuming set-up, management and tuning of both Hadoop clusters and the compute capacity upon which they sit. A user can instantly spin up large Hadoop job flows and begin processing within minutes.
Over the last year AWS has worked with current users to develop new features that make it even easier to execute Hadoop applications in the cloud. In this session, we will review lessons learned from those current users, discuss recent improvements to the Amazon Elastic MapReduce, and look at key features coming in the near future. In addition, we will discuss developments in the ecosystem in which exciting offerings developed on top of Amazon Elastic MapReduce have made it an even more compelling solution for enterprise Big Data analytics.

Peter Sirota is the General Manager of Amazon Elastic MapReduce. Before starting Amazon Elastic MapReduce, Peter served as Sr. Manager of Software Development at Amazon Web Services leading billing, authentication, and portal teams and was responsible for launching Amazon DevPay service. Prior to Amazon.com, he was managing various development organizations at RealNetworks. Prior to RealNetworks Peter worked at Microsoft. Peter holds a bachelor's degree in computer science from Northeastern University.

Top

Inside Large-Scale Analytics at Facebook

Mike Schroepfer, VP of Engineering, Facebook

Facebook has one of the largest Hadoop clusters in the world. Come hear from Mike Schroepfer, Facebook's VP of Engineering, about how the social networking service uses Hadoop and Hive to analyze data at scale. The talk will cover how the company deploys the technology and what products utilize the power of Hadoop and Hive.

Mike Schroepfer is the Vice President of Engineering at Facebook. Mike is responsible for harnessing the engineering organization's culture of speed, creativity and exploration to build products, services and infrastructure that support the company's users, developers and partners around the world. Before coming to Facebook, Mike was the Vice President of Engineering at Mozilla Corporation, where he led the global, collaborative, open and participatory product development process behind Mozilla's popular software, such as the Firefox web browser. Mike was formerly a distinguished engineer at Sun Microsystems where he was the Chief Technology Officer for the data center automation division ("N1"). He was also the founder, Chief Architect and Director of Engineering at CenterRun, which was acquired by Sun. Mike worked with several startups at the outset of his career, including a digital effects software startup where he built software that has been used in several major motion pictures. Mike holds a bachelor's degree and a master's degrees in computer science from Stanford University and has filed two U.S. patents.

Top

Hadoop Grows Up

Doug Cutting, Cloudera

Hadoop project founder Doug Cutting will give us his unique insight and perspective into how the project has evolved over the last 4 years as new sub-projects have grown up and out. He will then give us a glance at how he sees things changing going forward as the data management industry more broadly embraces Hadoop for information and analysis.
Doug is the creator of numerous successful open source projects, including Lucene, Nutch and Hadoop. Doug joined Cloudera in 2009 from Yahoo!, where he was a key member of the team that built and deployed a production Hadoop storage and analysis cluster for mission-critical business analytics. Doug holds a Bachelor's degree from Stanford University and sits on the Board of the Apache Software Foundation.

Top

Disruptive Applications with Hadoop

Rod Smith, VP, IBM Emerging Internet Technologies

Finding the insights you need from data is like finding the proverbial needle in the haystack, but today the haystack is bigger than ever. Whether it's on the Web or in their datacenters, businesses know that to keep pace with their competition, they need to be able to analyze massive amounts of information quickly and in a cost effective way.
That's where Hadoop comes in. As the world is learning to cope with the explosive growth of data, Hadoop is emerging as a disruptive technology that can power new classes of big data applications while integrating with existing middleware infrastructures; applications like monitoring massive datasets, computationally intensive jobs for evidence-based medicine, analysis for fraud detection, and more.
It's Hadoop that's making these next generation solutions feasible, many of which will provide a disruptive advantage for those that implement them. For example; your competition might wait for weeks to get insights from 40 terabyte of data, while you get your results in a day. Or imagine seeing a simple tag-cloud that represents the data from millions of PDFs without manually sifting through them. What about seeing how people on Twitter feel about your latest press release - is it good or bad? Getting information like this, efficiently and at a low cost, can put you in the lead.
In this presentation you'll hear how IBM is using Hadoop as a platform for applications and see a demonstration of how you can harness the power of this disruptive technology to help leverage the value of big data and big data analytics.

Rod Smith is an IBM fellow and Vice President of the IBM Emerging Internet Technologies organization, where he leads a group of highly technical innovators who are developing solutions to help businesses realize the value of Web 2.0 to the enterprise. In his many years in the industry Rod has moved IBM and the IT community to a rapid adoption of technologies such as Web services, enterprise mashups, XML, Linux, J2EE, rich internet applications, and various wireless standards. As an IBM Fellow, Rod is helping lead IBM's strategic planning around Web 2.0 technologies and practices, with a focus on how these technologies can bring real business value to IBM's customers

Top

Hadoop Security in Detail

Owen O'Malley, Yahoo!

Interested in knowing what the Hadoop Security release comprises and how to leverage it to solve business challenges. The talk will primarily cover the user/developer impact and a peek at the implementation
Owen O'Malley is a software architect on Hadoop working for Yahoo's Grid team, which is part of the Cloud Computing & Data Infrastructure group. He has been contributing patches to Hadoop since before it was separated from Nutch, and is the chair of the Hadoop Project Management Committee. Before working on Hadoop, he worked on Yahoo Search's WebMap project, which builds a graph of the known web and applies many heuristics to the entire graph that control search. Prior to Yahoo, he wandered between testing (UCI), static analysis (Reasoning), configuration management (Sun), and software model checking (NASA). He received his PhD in Software Engineering from University of California, Irvine

Top

Hive integration: HBase and Rcfile

John Sichi and Yongqiang He, Facebook

John Sichi will discuss Facebook's recent integration of two related projects in the Hadoop ecosystem: HBase and Hive. This integration gives powerful SQL query capabilities to HBase, and brings the potential for low-latency incremental data refresh to Hive. The talk will go over performance results from initial testing of the integration. Yongqiang will discuss RCFile, which is a columnar storage for Hive. It is already deployed within Facebook, and we are in the process of converting old partitions to RCFile. Depending on the data layout, it has resulted in ~20% space savings.

Top

ZettaVox: Content Mining and Analysis Across Heterogeneous Compute Clouds

Mark Davis, Kitenga

ZettaVox is an enterprise content mining application that combines crawling, extraction, monitoring, and analysis in a unified solution. Working from a sophisticated Flex-based user interface, ZettaVox users can visually author complex Hadoop workflows that end in the visualization of results sets using out-of-the-box tools for biological informatics, natural language analysis, automatic classification, and related tasks. In this talk, Kitenga CTO and ZettaVox designer Mark Davis will demonstrate ZettaVox capabilities on clusters and using cloud-based data resources. He will further show how ZettaVox can extend the reach of cluster-based computing solutions built on Hadoop to include commodity supercomputers based on graphical processing units (GPUs) that implement MapReduce formalisms.
Kitenga is a growing start-up focused on content mining solutions for big data problems. Kitenga is currently supported by the Office of the Secretary of Defense, Office of Naval Research, and US Army, with the goal of creating massive-scale information exploitation capabilities for government and commercial customers.

Top

Mining Billion-node Graphs: Patterns,Generators and Tools

Christos Faloutsos, Carnegie Mellon University

What do graphs look like? How do they evolve over time? How to handle a graph with a billion nodes? We present a comprehensive list of static and temporal laws, and some recent observations on real graphs (like, e.g., ``eigenSpokes''). For generators, we describe some recent ones, which naturally match all of the known properties of real graphs. Finally, for tools, we present ``oddBall'' for discovering anomalies and patterns, as well as an overview of the PEGASUS system which is designed for handling Billion-node graphs, running on top of the ``hadoop'' system.

Top

Hadoop and Pig at Twitter

Kevin Weil,Twitter

Massive growth in the size of business datasets leads many companies to Hadoop, an emerging architecture for parallel data processing. However, the migration path can be challenging, in part because MapReduce analyses use programming languages like Java and Python rather than SQL. Apache Pig is a high-level framework built on top of Hadoop that offers a powerful yet vastly simplified way to analyze data in Hadoop. It allows businesses to leverage the power of Hadoop in a simple language readily learnable by anyone that understands SQL. In this presentation, I will introduce Pig and show how it's been used at Twitter to solve numerous analytics challenges that became intractable with our former MySQL-based architecture.

Top

Design Patterns for Efficient Graph Algorithms in MapReduce

Jimmy Lin, Michael Schatz, University of Maryland

Graphs are analyzed in many important contexts, including ranking search results based on the hyperlink structure of the world wide web, module detection of protein-protein interaction networks, and privacy analysis of social networks. MapReduce provides an enabling technology for large-scale graph processing. However, there appears to be a paucity of knowledge on designing scalable graph algorithms. Existing best practices for MapReduce graph algorithms have significant shortcomings that limit performance, especially with respect to partitioning, serializing, and distributing the graph. We present three design patterns that address these issues and can be used to accelerate a large class of graph algorithms based on message passing, exemplified by PageRank. Experiments show that the application of our design patterns reduces the running time of PageRank on a web graph with 1.4 billion edges by 69%.

Top

Biometric Databases and Hadoop

Jason Trost, Abel Sussman and Lalit Kapoor, Booz Allen Hamilton

Abstract. Over the next few years Biometric databases for the Federal Bureau of Investigations (FBI) , Department of State (DoS) , Department of Defense (DoD) and the Department of Homeland Security (DHS) are expected to grow to accommodate hundreds of millions, if not billions, of identities. As an example, the DHS IDENT database is currently at 110 million identities and enrolling or verifying over 125,000 per day. Associated with each identity in these databases are several biometric modalities including fingerprint scans, palm prints, face images, and iris scans . This corresponds to tens of petabytes of biometric material in relational database storage [1]. In the near future the systems supported by these databases are expected to perform more accurate matches in ever shortening amounts of time.
Biometric databases are typically used for two main applications: Identification and Verification. Identification is determining the identity of an individual given a biometric measurement (one-to-many search), and Verification is determining whether an individual is who they claim to be (one-to-one matching). With traditional vertical scaling models, the Identification problem can become prohibitively expensive or infeasible as the number of modality records grows into the billions and the amount of data grows into the petabyte scale. Investments by Relational Data Base Management System (RDBMS) vendors in providing scalable multi-petabyte solutions have not yet produced concrete results to meet these needs.
As these biometric systems grow, distributed computing platforms such as Hadoop/MapReduce may be a feasible solution to this problem. In this presentation we outline the magnitude of the problem, and evaluate algorithms and solutions using Hadoop and MapReduce. We also discuss open source biometric algorithms including BOZORTH3 (fingerprint matching) and IrisCode (Iris Scan matching), and how they can be optimized for deployment over Hadoop/MapReduce. Techniques for reducing the search space and indexing of Biometric databases already exist in academia and we discuss how these techniques benefit from the parallel algorithms included in Mahout, such as K-means clustering.

Top

XXL Graph Algorithms

Sergei Vassilvitskii, Yahoo! Labs

The MapReduce framework is now a de facto standard for massive dataset computations. However, many of the elementary graph algorithms are inherently sequential and appear to be hard to parallelize (often requiring number of rounds proportional to the diameter of the graph). In this talk we describe a different approach, called filtering, to implementing fundamental graph algorithms, like computing connected components and minimum spanning trees. We note that filtering can also be applied to speed up general clustering algorithms like k-means. Finally, we describe how to apply the technique to find tight-knit friend groups in a social network.

Top

Developer's Most Frequent Hadoop Headaches & How to Address Them

Shevek Mankin, Karmasphere

In this interactive session, we will discuss and present solutions to the challenges most frequently encountered during development and deployment of MapReduce applications. The session content reflects the result of scouring customer mailing lists, forums and user comments for the most common application development problems and questions. We will prepare you for things like debugging a combiner or reducer in the middle of a workflow, diagnosing performance bottlenecks, integrating third party libraries into a Hadoop job, or developing on a Windows workstation. From addressing common prototyping challenges to coping effectively with Hadoop updates and upgrades, this session will appeal to those recently starting with Hadoop as well as seasoned experts. Bring your development questions and be ready to walk away with practical answers to the most frequently encountered problems with developing MapReduce and Hadoop applications -- and some uncommon ones you may hit as well.

Top

Hadoop - Integration Patterns and Practices

Eric Sammer, Cloudera

Hadoop is a powerful platform for data analysis and processing, but many struggle to understand how it fits in with regard to existing infrastructure and systems. A series of common integration points, technologies, and patterns are defined and illustrated in this presentation.
We take a look at job initiation, sequencing and scheduling, data input from various sources (e.g. DBMS, messaging systems), and data output to various sinks (DBMS, messaging systems, caching systems). Attendees will see how integration patterns and best practices can be applied to Hadoop and its related projects.
This talk is focused on the suitability and architecture of these integration patterns. Care is taken to not duplicate talks on specific tools that are likely to be covered by other talks.

Top

Workflow on Hadoop Using Oozie

Alejandro Abdelnur, Yahoo!

Oozie v1 is a PDL workflow server engine for Hadoop that enables creating workflow jobs composed of several map-reduce jobs, pig jobs, HDFS operations, java processes. Workflow jobs are monitored as single unit via web-services, a Java API and/or a web console. Oozie v1 is in production in Yahoo, and it became the standard way for doing computations that require the coordination of multiple Hadoop/Pig jobs.
Oozie v2 is currently entering production. Among its key new features: are time based activation of workflow jobs (crontab like), and data activation of workflow jobs (start a workflow job when input data becomes available).
Oozie v1 has been already open sourced and Oozie v2 is being open sourced, (both) at GitHub. During this talk we'll go over the steps to get Oozie up and running.
Alejandro has been working at Yahoo since 2006. Since joining Yahoo he has been involved in enterprise systems built on top of Hadoop. Alejandro has designed and partially implemented Oozie. He is also a contributor to Apache Hadoop. Before joining Yahoo, Alejandro worked at Google, Sun Microsystems and Sybase Argentina.

Top

Efficient Parallel Set-Similarity Joins Using Hadoop

Chen Li, University of California, Irvine

A set-similarity join (SSJ ) finds pairs of set-based records such that each pair is similar enough based on a similarity function and a threshold. Many applications require efficient SSJ solutions, such as record linkage and plagiarism detection. We study how to efficiently perform SSJs on large data sets using Hadoop. We propose a 3-stage approach to the problem. We efficiently partition the data across nodes in order to balance the workload and minimize the need for replication. We study both self-join and R-S join cases, and show how to carefully control the amount of data kept in main memory on each node. We report results from extensive experiments on real datasets, synthetically increased in size, to evaluate the speedup and scaleup properties of the proposed algorithms using Hadoop. The results of the research including its publication and source code are available at http://asterix.ics.uci.edu/fuzzyjoin-mapreduce/

Top

Winning the Big Data SPAM Challenge

Stefan Groschupf, Datameer; Florian Leibert, Erich Nachbar

Worldwide spam volumes this year are forecast to rise by 30 to 40% compared with 2009. Spam recently reached a record 92% of total email. Spammers have turned their attention to social media sites as well. While in 2008 there were few Facebook phishing messages, Facebook is now the second most phished organization online. Even though Twitter has managed to recently bring its spam rate down to as low as 1%, the absolute volume of spam is still massive given its tens of millions of users.
Dealing with spam introduces a number of Big Data challenges. The sheer size and scale of the data is enormous. In addition, spam in social media involves the need to understand very complex patterns of behavior as well as to identify new types of spam. This presentation will provide an in-depth understanding of these issues as well as discuss and demonstrate how data analytics built on Hadoop can help businesses keep spam from spiraling out of control.

Top

Exact Inference in Bayesian Networks using MapReduce

Alex Kozlov, Cloudera

Probabilistic inference is a way of obtaining values of unobservable variables out of incomplete data. Probabilistic inference is used in robotics, medical diagnostic, image recognition, finance and other fields. One of the tools for inference and a way to represent knowledge is "Bayesian Network", where nodes represent variables and edges represent probabilistic dependencies between variables. The advantage of exact probabilistic inference using BN is that it does not involve the traditional 'Gaussian distribution' assumptions and the results are immune to Taleb's distributions, or distributions with a high probability of outliers.
A typical application of probabilistic inference is to infer the probability of one or several dependent variables, like the probability that a person has a certain disease, given other observations, like presence of abdominal pain. In exact probabilistic inference, variables are clustered in groups, called cliques, and probabilistic inference can be carried out by manipulating more or less complex data structures on top of the cliques, which leads to high computational and space complexity of the inference: the data structures can become very complex and large. The advantage: one can encode arbitrarily complex distributions and dependencies.
While a lot of research has been devoted to devising schemes to approximate the solution, Hadoop allows performing exact inference on the whole network. We present an approach for performing large-scale probabilistic inference in probabilistic networks in a Hadoop cluster. Probabilistic inference is reduced to a number of MR jobs over the data structures representing clique potentials. One of the applications is the CPCS BN, one of the biggest models created at Stanford Medical Informatics Center (now The Stanford Center for Biomedical Informatics Research) in 1994, never solved exactly. In this specific network containing 422 nodes representing states of different variables, 14 nodes describe diseases, 33 nodes describe history and risk factors, and the remaining 375 nodes describe various findings related to the diseases.

Top

Cascalog: an Interactive Query Language for Hadoop

Nathan Marz, BackType

Cascalog is an interactive query language for Hadoop with a focus on simplicity, expressiveness, and flexibility intended to be used by Analysts and Developers alike.
Cascalog eschews the SQL syntax for a simpler and more expressive syntax based on Datalog. With this added expressiveness, Cascalog can query existing data stores "out of the box" with no required data "importing" or "under the hood" configuration necessary. Because Cascalog sits on top of Clojure, a powerful JVM based language and interactive shell, adding new operations to a query is as simple as defining a new function. Cascalog relies on Cascading, a robust data processing API, for defining and running workflows.
In this presentation, I will introduce Cascalog and how it's used at BackType. I will further show how the Datalog syntax provides more robustness and flexibility than SQL based languages. Finally, I'll demonstrate how the Cascalog, Clojure, and Cascading stack can be leveraged by advanced users who wish to build more complex queries and libraries in Java and Clojure for data processing, data mining, and machine learning.

Top

Data Applications and Infrastructure at LinkedIn

Jay Kreps, LinkedIn

LinkedIn runs a number of large-scale Hadoop calculations to power its features--from computing similar profiles, jobs, and companies to predicting People You May Know recommendations to help our users find their professional connections. This talk will cover how Hadoop fits into a production data cycle for a consumer-scale social network--including some of the technology, infrastructure, and algorithms for calculating tens of billions of predictions in a social graph. This includes how we use our open source workflow scheduler Azkaban; how we use Pig and other open source technologies; and how we handle daily, large, parallel data deployments, and how we do real-time serving and processing using our key-value storage system, Project Voldemort

Top

Hadoop for Scientific Workloads

Lavanya Ramakrishnan, Lawrence Berkeley National Lab

Scientific computing is composed of bulk-synchronous compute and data-intensive applications. Magellan, a recently funded cloud computing project, is investigating how cloud computing can serve the needs of mid-range computing and future data-intensive scientific workloads. The growth of data and the need for more resources has led to the scientific community exploring cloud as resource platform. Scientific users are interested in using MapReduce and Hadoop for managing the scientific computations and data. However, the requirements of these scientific applications are significantly different from the web 2.0 applications that have traditionally used Hadoop. We are evaluating the use of Hadoop and related technologies for use in support of scientific applications. In this presentation, we will a) outline the science requirements and discuss the use of Hadoop and related technologies such as HBASE b) present a performance comparison of a bioinformatics application using Hadoop on commercial cloud platforms such as Amazon EC2, Yahoo! M45 with a high performance computing system c) present experiences and performance results from local Hadoop and HBASE installation with different file system and scheduling configurations specifically suited for scientific applications.

Top

Honu - A Large Scale Streaming Data Collection and Processing Pipeline

Jerome Boulon, Netflix

Netflix moved a large portion of their infrastructure to the cloud in order to meet our reliability, scalability and availability requirements. As we solved our reliability, scalability and availability requirements on the application side, we open a new range of challenging problem on the log analytics side. As the number of instances running in the cloud increase, standard way of moving log files or loading log events to a database starts saturating the system. As the volume explodes, the latency/thruput of the system becomes so slow that it was not usable for our operational/BI needs.
Honu is the new streaming log collection and processing pipeline in the cloud for Netflix and leverages the computational power of Hadoop/Hive to solve our log analytics requirements.
Honu solution includes three main components: A client side SDK, A collector, server side component to receive all logs events and A Map/Reduce to process/parse those logs and save them in Hive format.
The client side SDK provides all the classes you need in order to generate/collect your unstructured and structured log events on your application. Honu collectors are a key component and are responsible for continuously saving log Events to HDFS. The Map/Reduce is responsible for parsing log events and storing them in their corresponding Hive table format.
The presentation will explain how we are using those components in order to generate, collect and process all the unstructured and structured log events and make them available to our end-user through Hive.
Honu will be released under the standard Apache license and will be hosted for now in gitHub. Honu has been running in production at Netflix for more than 3 months and has some specific integrations to leverage the Amazon EC2, S3 and EMR infrastructure.

Top

Online Content Optimization with Hadoop

Amit Phadke, Yahoo!

One of the most interesting problems we work on is to provide the most relevant content to our users. This involves being able to track what are the interests of our users; mining the ever-changing content pool to see what is relevant, popular for our users. There is also content normalizing and de-duping issues to avoid redundancy. To solve all these problems, we make extensive use of Hadoop technology stack in our systems. Using Hadoop, we are able to scale to build models for millions of items, and users in near-real time. We leverage HBase for point lookups/stores of these models. We also use Pig for phrasing our workflows so the map-reduce parallelism is abstracted out of core processing.
Amit Phadke has been working at Yahoo! Since past 3.5 years. Since joining Yahoo!, he has been involved in various aspects of content optimization. Amit and his colleagues have built a scalable modeling infrastructure on top of Hadoop.

Top

Hadoop Frameworks Panel: Pig, Hive, Cascading, Cloudera Desktop, LinkedIn Voldemort, Twitter ElephantBird

Moderator: Sanjay Radia, Yahoo!

A number of frameworks and tools have been built on top of Hadoop to make it easier to write applications and manage Hadoop. The panel members consists of experts/developers of such frameworks and tools. The panel members will discuss:
1) The problem space and target audience their specific technology addresses
2) Their plans for the future enhancements
3) What is missing in the overall space.

Top

Hadoop for Genomics

Jeremy Bruestle, Spiral Genetics

The field of genomics is of increasing importance to research and medicine. As the physical cost of DNA sequencing continues to drop, biologists are collecting ever larger data sets, requiring more sophisticated data processing. Hadoop is an excellent platform on which to build a consistent set of tools for genomics research. In this talk, we present a general framework for working with genomic data in Hadoop, and provide details on implementations for many common operations, including a novel mechanism for de novo DNA sequence assembly. In addition, we discuss how this open source genomics platform can be leveraged by researchers to reduce repeated effort and increase collaboration

Top

Hadoop Customer Panel: Amazon Elastic MapReduce

Moderator: Deepak Singh, Amazon Web Services

Many Amazon Web Services customers leverage Hadoop inside Amazon Elastic MapReduce, to solve problems ranging from mining clickstream data for targeted advertising, to scientific applications. In this panel, Amazon Web Services customers will:
1. Discuss a diverse set of use cases where Hadoop is being applied today
2. Talk about the enterprise readiness of Hadoop
3. Talk about how Amazon Elastic MapReduce addresses some of the key challenges of running Hadoop in a production environment
4. Identify features and solutions that will lead to wider adoption

Top

Parallel Distributed Image Stacking and Mosaicing with Hadoop

Keith Wiley, University of Washington

In the coming decade, astronomical surveys of the sky will generate tens of terabytes of images and detect hundreds of millions of sources every night. With a requirement that these images be analyzed in real time to identify moving sources such as potentially hazardous asteroids or transient objects such as supernovae, these data streams present many computational challenges. In this talk, we report on our experience implementing a scalable image-processing pipeline for the SDSS database using Hadoop. This multi-Terabyte imaging dataset provides a good testbed for algorithm development since its scope and structure approximate future surveys. Our pipeline performs two primary functions: stacking and mosaicing, in which multiple partially overlapping images are registered, integrated and stitched into a single overarching image. We will first present our initial implementation, then describe two critical optimizations that have enabled us to achieve high performance. The first optimization addresses the fact that the individual image files are quite small relative to the optimal Hadoop block size, while at the same time they incur significant job startup cost due to their large quantity. We improved Hadoop's handling of this data by converting the database to a set of Hadoop sequences files, each file comprising numerous image files. This optimization decreased the proportion of total jobtime spent on input-path processing at startup from 42% to 2% with an overall jobtime decrease of 260%. The second optimization involves the development of a hybrid relational-database/Hadoop system. In the initial implementation, many image files were considered but discarded at the mapper stage. By prepending the Hadoop job with a SQL-based metadata query, we can eliminate the same files from consideration before running the MapReduce job. We will report on the performance gains delivered by this pre-filtering step.

Top