| Time | Agenda | ||
| 08:30-09:00 | Registration and Coffee Sponsored by IBM |
09:00-10:15 | Big Data and the Power of Hadoop [ video ]Blake Irving, Executive Vice President and Chief Products Officer, Yahoo! |
Hadoop and The Future of Internet Scale Cloud Computing [ video ]Shelton Shugar, Senior Vice President, Cloud Computing, Yahoo! |
Scaling Hadoop [ video ]Eric Baldeschwieler, Vice President, Hadoop Software Development, Yahoo! |
|||
| 10:15-10:30 | Coffee Break Sponsored by IBM |
||
| 10:30-11:00 | Making Hadoop Enterprise Ready with Amazon Elastic
MapReduce [ video ]Peter Sirota, General Manager, Elastic Map Reduce |
||
| 11:00-11:30 | Hadoop Grows Up [ video ]
Doug Cutting, Cloudera |
||
| 11:30-12:00 | Inside Large-Scale Analytics at Facebook [ video ]
Mike Schroepfer, VP of Engineering, Facebook |
||
| 12:00-1:30 | Lunch Break Co-Sponsored by Amazon Web Services, Cloudera, Datameer, Karmasphere |
||
| Developers Track | Applications Track | Research Track | |
| 1:30-2:00 | Hadoop Security in Detail ![]() Owen O'Malley, Yahoo! |
Disruptive Applications with Hadoop ![]() Rod Smith, VP, IBM Emerging Internet Technologies |
Design Patterns for Efficient Graph Algorithms
in MapReduce ![]() Jimmy Lin, Michael Schatz, University of Maryland |
| 2:00-2:30 | Hive integration: HBase and Rcfile ![]() John Sichi and Yongqiang He, Facebook |
ZettaVox: Content Mining and Analysis Across
Heterogeneous Compute Clouds ![]() Mark Davis, Kitenga |
Mining Billion-node Graphs: Patterns,
Generators and Tools ![]() Christos Faloutsos, Carnegie Mellon University |
| 2:30-3:00 | Hadoop and Pig at Twitter ![]() Kevin Weil, Twitter |
Biometric Databases and Hadoop ![]() Jason Trost, Abel Sussman and Lalit Kapoor, Booz Allen Hamilton |
XXL Graph Algorithms ![]() Sergei Vassilvitskii, Yahoo! Labs |
| 3:00-3:30 | Developer's Most Frequent Hadoop Headaches &
How to Address Them ![]() Shevek Mankin, Karmasphere |
Hadoop - Integration Patterns and Practices ![]() Eric Sammer, Cloudera |
Efficient Parallel Set-Similarity Joins Using
Hadoop ![]() Chen Li, University of California, Irvine |
| 3:30-4:00 | Coffee Break Sponsored by IBM |
||
| 4:00-4:30 | Workflow on Hadoop Using Oozie ![]() Alejandro Abdelnur, Yahoo! |
Winning the Big Data SPAM Challenge ![]() Stefan Groschupf, Datameer; Florian Leibert, Erich Nachbar |
Exact Inference in Bayesian Networks using
MapReduce ![]() Alex Kozlov, Cloudera |
| 4:30-5:00 | Cascalog: an Interactive Query Language for
Hadoop ![]() Nathan Marz, BackType |
Data Applications and Infrastructure at
LinkedIn ![]() Jay Kreps, LinkedIn |
Hadoop for Scientific Workloads ![]() Lavanya Ramakrishnan, Lawrence Berkeley National Lab |
| 5:00-5:30 | Honu - A Large Scale Streaming Data Collection
and Processing Pipeline ![]() Jerome Boulon, Netflix |
Online Content Optimization with Hadoop ![]() Amit Phadke, Yahoo! |
Hadoop for Genomics ![]() Jeremy Bruestle, Spiral Genetics |
| 5:30-6:15 | Hadoop Frameworks Panel:
Pig, Hive, Cascading, Cloudera Desktop, LinkedIn Voldemort,
Twitter ElephantBird ![]() Moderator: Sanjay Radia, Yahoo! |
Hadoop Customer Panel: Amazon Elastic
MapReduce Moderator: Deepak Singh, Amazon Web Services |
Parallel Distributed Image Stacking and
Mosaicing with Hadoop ![]() Keith Wiley, University of Washington |
| 6:30-8:30 | Evening Reception Sponsored by LightSpeed | ||
Abstracts
Big Data and the Power of Hadoop
Blake Irving, Executive Vice President and Chief Products Officer, Yahoo!
Behind every click across Yahoo!, there is a global cloud computing infrastructure, heavily reliant on Hadoop. We see the magic in crunching unimaginable volumes of data to create increasingly relevant Internet experiences for people around the world – and Hadoop is the technology behind that magic.
Blake Irving is Executive Vice President and Chief Products Officer at Yahoo!. Irving leads Yahoo!'s Products organization, which is responsible for the vision, strategy, design and development of Yahoo!'s global consumer and advertiser product portfolio. Irving is focused on building unique and highly personal experiences for Yahoo!'s consumers, delivering on Yahoo!'s promise of Science, Art and Scale to its advertisers, and continuing to deliver more and faster innovations to the market.
Hadoop and The Future of Internet Scale Cloud Computing
Shelton Shugar, Senior Vice President, Cloud Computing, Yahoo!
Looking to the future, Yahoo!’s vision for the cloud is closely intertwined with open source computing. As implementer of the world’s largest Hadoop deployment, Yahoo! serves as the canary in the coal mine signaling emerging trends and challenges that the industry will face as Hadoop adoption continues on its sharp increase.
Shelton Shugar is Senior Vice President of Cloud Computing at Yahoo!. Today nearly all of Yahoo!’s consumer and advertiser experiences rely on the cloud. Yahoo!’s cloud stores and processes Yahoo!’s massive amounts of data and content, delivers it in a reliable and scalable way, making it accessible to users around the world on a variety of platforms and form factors.
Scaling Hadoop
Eric Baldeschwieler, Vice President, Hadoop Software Development, Yahoo!
At Yahoo!, we’re engineering for global scale, and constantly expanding the frontiers of Hadoop. Hear how we’re scaling Hadoop to handle larger clusters, growing application complexity and evolving business needs.
Eric Baldeschwieler is Vice President of Hadoop at Yahoo!. Eric leads the development team responsible for designing and deploying Hadoop across Yahoo!’s global infrastructure, the world’s largest implementation of the technology. A team of Yahoo!’s, including Eric and Doug Cutting, initiated the Apache Hadoop Project in 2005, and have taken Hadoop from a 20-node prototype to a 35,000-node production deployment.
Making Hadoop Enterprise Ready with Amazon Elastic MapReduce
Peter Sirota, General Manager, Elastic Map Reduce
As the demand for cloud-based analysis of large data sets explodes, customers of Amazon Web Services have wanted to leverage Hadoop in the cloud. Last year we introduced Amazon Elastic MapReduce, which manages the complex and time-consuming set-up, management and tuning of both Hadoop clusters and the compute capacity upon which they sit. A user can instantly spin up large Hadoop job flows and begin processing within minutes.
Over the last year AWS has worked with current users to develop new features that make it even easier to execute Hadoop applications in the cloud. In this session, we will review lessons learned from those current users, discuss recent improvements to the Amazon Elastic MapReduce, and look at key features coming in the near future. In addition, we will discuss developments in the ecosystem in which exciting offerings developed on top of Amazon Elastic MapReduce have made it an even more compelling solution for enterprise Big Data analytics.
Peter Sirota is the General Manager of Amazon Elastic MapReduce. Before starting Amazon Elastic MapReduce, Peter served as Sr. Manager of Software Development at Amazon Web Services leading billing, authentication, and portal teams and was responsible for launching Amazon DevPay service. Prior to Amazon.com, he was managing various development organizations at RealNetworks. Prior to RealNetworks Peter worked at Microsoft. Peter holds a bachelor's degree in computer science from Northeastern University.
Inside Large-Scale Analytics at Facebook
Mike Schroepfer, VP of Engineering, Facebook
Facebook has one of the largest Hadoop clusters in the world. Come hear
from Mike Schroepfer, Facebook's VP of Engineering, about how the
social networking service uses Hadoop and Hive to analyze data at
scale. The talk will cover how the company deploys the technology and
what products utilize the power of Hadoop and Hive.
Mike Schroepfer is the Vice President of Engineering at Facebook. Mike
is responsible for harnessing the engineering organization's culture of
speed, creativity and exploration to build products, services and
infrastructure that support the company's users, developers and
partners around the world. Before coming to Facebook, Mike was the Vice
President of Engineering at Mozilla Corporation, where he led the
global, collaborative, open and participatory product development
process behind Mozilla's popular software, such as the Firefox web
browser. Mike was formerly a distinguished engineer at Sun Microsystems
where he was the Chief Technology Officer for the data center
automation division ("N1"). He was also the founder, Chief Architect
and Director of Engineering at CenterRun, which was acquired by Sun.
Mike worked with several startups at the outset of his career,
including a digital effects software startup where he built software
that has been used in several major motion pictures. Mike holds a
bachelor's degree and a master's degrees in computer science from
Stanford University and has filed two U.S. patents.
Hadoop Grows Up
Doug Cutting, Cloudera
Hadoop project founder Doug Cutting will give us his unique insight and
perspective into how the project has evolved over the last 4 years as
new sub-projects have grown up and out. He will then give us a glance
at how he sees things changing going forward as the data management
industry more broadly embraces Hadoop for information and analysis.
Doug is the creator of numerous successful open source projects,
including Lucene, Nutch and Hadoop. Doug joined Cloudera in 2009 from
Yahoo!, where he was a key member of the team that built and deployed a
production Hadoop storage and analysis cluster for mission-critical
business analytics. Doug holds a Bachelor's degree from Stanford
University and sits on the Board of the Apache Software Foundation.
Disruptive Applications with Hadoop
Rod Smith, VP, IBM Emerging Internet Technologies
Finding the insights you need from data is like finding the proverbial
needle in the haystack, but today the haystack is bigger than ever.
Whether it's on the Web or in their datacenters, businesses know that
to keep pace with their competition, they need to be able to analyze
massive amounts of information quickly and in a cost effective way.
That's where Hadoop comes in. As the world is learning to cope with the
explosive growth of data, Hadoop is emerging as a disruptive technology
that can power new classes of big data applications while integrating
with existing middleware infrastructures; applications like monitoring
massive datasets, computationally intensive jobs for evidence-based
medicine, analysis for fraud detection, and more.
It's Hadoop that's making these next generation solutions feasible,
many of which will provide a disruptive advantage for those that
implement them. For example; your competition might wait for weeks to
get insights from 40 terabyte of data, while you get your results in a
day. Or imagine seeing a simple tag-cloud that represents the data from
millions of PDFs without manually sifting through them. What about
seeing how people on Twitter feel about your latest press release - is
it good or bad? Getting information like this, efficiently and at a low
cost, can put you in the lead.
In this presentation you'll hear how IBM is using Hadoop as a platform
for applications and see a demonstration of how you can harness the
power of this disruptive technology to help leverage the value of big
data and big data analytics.
Rod Smith is an IBM fellow and Vice President of the IBM
Emerging Internet Technologies organization, where he leads a group of
highly technical innovators who are developing solutions to help
businesses realize the value of Web 2.0 to the enterprise. In his many
years in the industry Rod has moved IBM and the IT community to a
rapid adoption of technologies such as Web services, enterprise
mashups, XML, Linux, J2EE, rich internet applications, and various
wireless standards. As an IBM Fellow, Rod is helping lead IBM's
strategic planning around Web 2.0 technologies and practices, with a
focus on how these technologies can bring real business value to IBM's
customers
Hadoop Security in Detail
Owen O'Malley, Yahoo!
Interested in knowing what the Hadoop Security release comprises and how to leverage it to solve business challenges. The talk will primarily cover the user/developer impact and a peek at the implementation
Owen O'Malley is a software architect on Hadoop working for Yahoo's Grid team, which is part of the Cloud Computing & Data Infrastructure group. He has been contributing patches to Hadoop since before it was separated from Nutch, and is the chair of the Hadoop Project Management Committee. Before working on Hadoop, he worked on Yahoo Search's WebMap project, which builds a graph of the known web and applies many heuristics to the entire graph that control search. Prior to Yahoo, he wandered between testing (UCI), static analysis (Reasoning), configuration management (Sun), and software model checking (NASA). He received his PhD in Software Engineering from University of California, Irvine
Hive integration: HBase and Rcfile
John Sichi and Yongqiang He, Facebook
John Sichi will discuss Facebook's recent integration of two related projects in the Hadoop ecosystem: HBase and Hive. This integration gives powerful SQL query capabilities to HBase, and brings the potential for low-latency incremental data refresh to Hive. The talk will go over performance results from initial testing of the integration. Yongqiang will discuss RCFile, which is a columnar storage for Hive. It is already deployed within Facebook, and we are in the process of converting old partitions to RCFile. Depending on the data layout, it has resulted in ~20% space savings.
TopZettaVox: Content Mining and Analysis Across Heterogeneous Compute Clouds
Mark Davis, Kitenga
ZettaVox is an enterprise content mining application that combines
crawling, extraction, monitoring, and analysis in a unified solution.
Working from a sophisticated Flex-based user interface, ZettaVox users
can visually author complex Hadoop workflows that end in the
visualization of results sets using out-of-the-box tools for biological
informatics, natural language analysis, automatic classification, and
related tasks. In this talk, Kitenga CTO and ZettaVox designer Mark
Davis will demonstrate ZettaVox capabilities on clusters and using
cloud-based data resources. He will further show how ZettaVox can
extend the reach of cluster-based computing solutions built on Hadoop
to include commodity supercomputers based on graphical processing units
(GPUs) that implement MapReduce formalisms.
Kitenga is a growing start-up focused on content mining solutions for
big data problems. Kitenga is currently supported by the Office of the
Secretary of Defense, Office of Naval Research, and US Army, with the
goal of creating massive-scale information exploitation capabilities
for government and commercial customers.
Mining Billion-node Graphs: Patterns,Generators and Tools
Christos Faloutsos, Carnegie Mellon University
What do graphs look like? How do they evolve over time? How to handle a graph with a billion nodes? We present a comprehensive list of static and temporal laws, and some recent observations on real graphs (like, e.g., ``eigenSpokes''). For generators, we describe some recent ones, which naturally match all of the known properties of real graphs. Finally, for tools, we present ``oddBall'' for discovering anomalies and patterns, as well as an overview of the PEGASUS system which is designed for handling Billion-node graphs, running on top of the ``hadoop'' system.
TopHadoop and Pig at Twitter
Kevin Weil,Twitter
Massive growth in the size of business datasets leads many companies to Hadoop, an emerging architecture for parallel data processing. However, the migration path can be challenging, in part because MapReduce analyses use programming languages like Java and Python rather than SQL. Apache Pig is a high-level framework built on top of Hadoop that offers a powerful yet vastly simplified way to analyze data in Hadoop. It allows businesses to leverage the power of Hadoop in a simple language readily learnable by anyone that understands SQL. In this presentation, I will introduce Pig and show how it's been used at Twitter to solve numerous analytics challenges that became intractable with our former MySQL-based architecture.
TopDesign Patterns for Efficient Graph Algorithms in MapReduce
Jimmy Lin, Michael Schatz, University of Maryland
Graphs are analyzed in many important contexts, including ranking search results based on the hyperlink structure of the world wide web, module detection of protein-protein interaction networks, and privacy analysis of social networks. MapReduce provides an enabling technology for large-scale graph processing. However, there appears to be a paucity of knowledge on designing scalable graph algorithms. Existing best practices for MapReduce graph algorithms have significant shortcomings that limit performance, especially with respect to partitioning, serializing, and distributing the graph. We present three design patterns that address these issues and can be used to accelerate a large class of graph algorithms based on message passing, exemplified by PageRank. Experiments show that the application of our design patterns reduces the running time of PageRank on a web graph with 1.4 billion edges by 69%.
TopBiometric Databases and Hadoop
Jason Trost, Abel Sussman and Lalit Kapoor, Booz Allen Hamilton
Abstract. Over the next few years Biometric databases for the Federal
Bureau of Investigations (FBI) , Department of State (DoS) , Department
of Defense (DoD) and the Department of Homeland Security (DHS) are
expected to grow to accommodate hundreds of millions, if not billions,
of identities. As an example, the DHS IDENT database is currently at 110
million identities and enrolling or verifying over 125,000 per day.
Associated with each identity in these databases are several biometric
modalities including fingerprint scans, palm prints, face images, and
iris scans . This corresponds to tens of petabytes of biometric
material in relational database storage [1]. In the near future the
systems supported by these databases are expected to perform more
accurate matches in ever shortening amounts of time.
Biometric databases are typically used for two main applications:
Identification and Verification. Identification is determining the
identity of an individual given a biometric measurement (one-to-many
search), and Verification is determining whether an individual is who
they claim to be (one-to-one matching). With traditional vertical
scaling models, the Identification problem can become prohibitively
expensive or infeasible as the number of modality records grows into
the billions and the amount of data grows into the petabyte scale.
Investments by Relational Data Base Management System (RDBMS) vendors
in providing scalable multi-petabyte solutions have not yet produced
concrete results to meet these needs.
As these biometric systems grow, distributed computing platforms such
as Hadoop/MapReduce may be a feasible solution to this problem. In this
presentation we outline the magnitude of the problem, and evaluate
algorithms and solutions using Hadoop and MapReduce. We also discuss
open source biometric algorithms including BOZORTH3 (fingerprint
matching) and IrisCode (Iris Scan matching), and how they can be
optimized for deployment over Hadoop/MapReduce. Techniques for reducing
the search space and indexing of Biometric databases already exist in
academia and we discuss how these techniques benefit from the parallel
algorithms included in Mahout, such as K-means clustering.
XXL Graph Algorithms
Sergei Vassilvitskii, Yahoo! Labs
The MapReduce framework is now a de facto standard for massive dataset computations. However, many of the elementary graph algorithms are inherently sequential and appear to be hard to parallelize (often requiring number of rounds proportional to the diameter of the graph). In this talk we describe a different approach, called filtering, to implementing fundamental graph algorithms, like computing connected components and minimum spanning trees. We note that filtering can also be applied to speed up general clustering algorithms like k-means. Finally, we describe how to apply the technique to find tight-knit friend groups in a social network.
TopDeveloper's Most Frequent Hadoop Headaches & How to Address Them
Shevek Mankin, Karmasphere
In this interactive session, we will discuss and present solutions to the challenges most frequently encountered during development and deployment of MapReduce applications. The session content reflects the result of scouring customer mailing lists, forums and user comments for the most common application development problems and questions. We will prepare you for things like debugging a combiner or reducer in the middle of a workflow, diagnosing performance bottlenecks, integrating third party libraries into a Hadoop job, or developing on a Windows workstation. From addressing common prototyping challenges to coping effectively with Hadoop updates and upgrades, this session will appeal to those recently starting with Hadoop as well as seasoned experts. Bring your development questions and be ready to walk away with practical answers to the most frequently encountered problems with developing MapReduce and Hadoop applications -- and some uncommon ones you may hit as well.
TopHadoop - Integration Patterns and Practices
Eric Sammer, Cloudera
Hadoop is a powerful platform for data analysis and processing, but
many struggle to understand how it fits in with regard to existing
infrastructure and systems. A series of common integration points,
technologies, and patterns are defined and illustrated in this
presentation.
We take a look at job initiation, sequencing and scheduling, data input
from various sources (e.g. DBMS, messaging systems), and data output to
various sinks (DBMS, messaging systems, caching systems). Attendees
will see how integration patterns and best practices can be applied to
Hadoop and its related projects.
This talk is focused on the suitability and architecture of these
integration patterns. Care is taken to not duplicate talks on specific
tools that are likely to be covered by other talks.
Workflow on Hadoop Using Oozie
Alejandro Abdelnur, Yahoo!
Oozie v1 is a PDL workflow server engine for Hadoop that enables creating workflow jobs composed of several map-reduce jobs, pig jobs, HDFS operations, java processes. Workflow jobs are monitored as single unit via web-services, a Java API and/or a web console. Oozie v1 is in production in Yahoo, and it became the standard way for doing computations that require the coordination of multiple Hadoop/Pig jobs.
Oozie v2 is currently entering production. Among its key new features: are time based activation of workflow jobs (crontab like), and data activation of workflow jobs (start a workflow job when input data becomes available).
Oozie v1 has been already open sourced and Oozie v2 is being open sourced, (both) at GitHub. During this talk we'll go over the steps to get Oozie up and running.
Alejandro has been working at Yahoo since 2006. Since joining Yahoo he has been involved in enterprise systems built on top of Hadoop. Alejandro has designed and partially implemented Oozie. He is also a contributor to Apache Hadoop. Before joining Yahoo, Alejandro worked at Google, Sun Microsystems and Sybase Argentina.
Efficient Parallel Set-Similarity Joins Using Hadoop
Chen Li, University of California, Irvine
A set-similarity join (SSJ ) finds pairs of set-based records such that each pair is similar enough based on a similarity function and a threshold. Many applications require efficient SSJ solutions, such as record linkage and plagiarism detection. We study how to efficiently perform SSJs on large data sets using Hadoop. We propose a 3-stage approach to the problem. We efficiently partition the data across nodes in order to balance the workload and minimize the need for replication. We study both self-join and R-S join cases, and show how to carefully control the amount of data kept in main memory on each node. We report results from extensive experiments on real datasets, synthetically increased in size, to evaluate the speedup and scaleup properties of the proposed algorithms using Hadoop. The results of the research including its publication and source code are available at http://asterix.ics.uci.edu/fuzzyjoin-mapreduce/
TopWinning the Big Data SPAM Challenge
Stefan Groschupf, Datameer; Florian Leibert, Erich Nachbar
Worldwide spam volumes this year are forecast to rise by 30 to 40%
compared with 2009. Spam recently reached a record 92% of total email.
Spammers have turned their attention to social media sites as well.
While in 2008 there were few Facebook phishing messages, Facebook is
now the second most phished organization online. Even though Twitter
has managed to recently bring its spam rate down to as low as 1%,
the absolute volume of spam is still massive given its tens of
millions of users.
Dealing with spam introduces a number of Big Data challenges. The sheer
size and scale of the data is enormous. In addition, spam in social
media involves the need to understand very complex patterns of behavior
as well as to identify new types of spam. This presentation will
provide an in-depth understanding of these issues as well as discuss
and demonstrate how data analytics built on Hadoop can help businesses
keep spam from spiraling out of control.
Exact Inference in Bayesian Networks using MapReduce
Alex Kozlov, Cloudera
Probabilistic inference is a way of obtaining values of unobservable
variables out of incomplete data. Probabilistic inference is used in
robotics, medical diagnostic, image recognition, finance and other
fields. One of the tools for inference and a way to represent knowledge
is "Bayesian Network", where nodes represent variables and edges
represent probabilistic dependencies between variables. The advantage
of exact probabilistic inference using BN is that it does not involve
the traditional 'Gaussian distribution' assumptions and the results are
immune to Taleb's distributions, or distributions with a high
probability of outliers.
A typical application of probabilistic inference is to infer the
probability of one or several dependent variables, like the probability
that a person has a certain disease, given other observations, like
presence of abdominal pain. In exact probabilistic inference, variables
are clustered in groups, called cliques, and probabilistic inference
can be carried out by manipulating more or less complex data structures
on top of the cliques, which leads to high computational and space
complexity of the inference: the data structures can become very
complex and large. The advantage: one can encode arbitrarily complex
distributions and dependencies.
While a lot of research has been devoted to devising schemes to
approximate the solution, Hadoop allows performing exact inference on
the whole network. We present an approach for performing large-scale
probabilistic inference in probabilistic networks in a Hadoop cluster.
Probabilistic inference is reduced to a number of MR jobs over the data
structures representing clique potentials. One of the applications is
the CPCS BN, one of the biggest models created at Stanford Medical
Informatics Center (now The Stanford Center for Biomedical Informatics
Research) in 1994, never solved exactly. In this specific network
containing 422 nodes representing states of different variables, 14
nodes describe diseases, 33 nodes describe history and risk factors,
and the remaining 375 nodes describe various findings related to the
diseases.
Cascalog: an Interactive Query Language for Hadoop
Nathan Marz, BackType
Cascalog is an interactive query language for Hadoop with a focus on
simplicity, expressiveness, and flexibility intended to be used by
Analysts and Developers alike.
Cascalog eschews the SQL syntax for a simpler and more expressive
syntax based on Datalog. With this added expressiveness, Cascalog can
query existing data stores "out of the box" with no required data
"importing" or "under the hood" configuration necessary. Because
Cascalog sits on top of Clojure, a powerful JVM based language and
interactive shell, adding new operations to a query is as simple as
defining a new function. Cascalog relies on Cascading, a robust data
processing API, for defining and running workflows.
In this presentation, I will introduce Cascalog and how it's used at
BackType. I will further show how the Datalog syntax provides more
robustness and flexibility than SQL based languages. Finally, I'll
demonstrate how the Cascalog, Clojure, and Cascading stack can be
leveraged by advanced users who wish to build more complex queries and
libraries in Java and Clojure for data processing, data mining, and
machine learning.
Data Applications and Infrastructure at LinkedIn
Jay Kreps, LinkedIn
LinkedIn runs a number of large-scale Hadoop calculations to power its features--from computing similar profiles, jobs, and companies to predicting People You May Know recommendations to help our users find their professional connections. This talk will cover how Hadoop fits into a production data cycle for a consumer-scale social network--including some of the technology, infrastructure, and algorithms for calculating tens of billions of predictions in a social graph. This includes how we use our open source workflow scheduler Azkaban; how we use Pig and other open source technologies; and how we handle daily, large, parallel data deployments, and how we do real-time serving and processing using our key-value storage system, Project Voldemort
TopHadoop for Scientific Workloads
Lavanya Ramakrishnan, Lawrence Berkeley National Lab
Scientific computing is composed of bulk-synchronous compute and data-intensive applications. Magellan, a recently funded cloud computing project, is investigating how cloud computing can serve the needs of mid-range computing and future data-intensive scientific workloads. The growth of data and the need for more resources has led to the scientific community exploring cloud as resource platform. Scientific users are interested in using MapReduce and Hadoop for managing the scientific computations and data. However, the requirements of these scientific applications are significantly different from the web 2.0 applications that have traditionally used Hadoop. We are evaluating the use of Hadoop and related technologies for use in support of scientific applications. In this presentation, we will a) outline the science requirements and discuss the use of Hadoop and related technologies such as HBASE b) present a performance comparison of a bioinformatics application using Hadoop on commercial cloud platforms such as Amazon EC2, Yahoo! M45 with a high performance computing system c) present experiences and performance results from local Hadoop and HBASE installation with different file system and scheduling configurations specifically suited for scientific applications.
TopHonu - A Large Scale Streaming Data Collection and Processing Pipeline
Jerome Boulon, Netflix
Netflix moved a large portion of their infrastructure to the cloud in
order to meet our reliability, scalability and availability
requirements. As we solved our reliability, scalability and
availability requirements on the application side, we open a new range
of challenging problem on the log analytics side. As the number of
instances running in the cloud increase, standard way of moving log
files or loading log events to a database starts saturating the system.
As the volume explodes, the latency/thruput of the system becomes so
slow that it was not usable for our operational/BI needs.
Honu is the new streaming log collection and processing pipeline in the
cloud for Netflix and leverages the computational power of Hadoop/Hive
to solve our log analytics requirements.
Honu solution includes three main components: A client side SDK, A
collector, server side component to receive all logs events and A
Map/Reduce to process/parse those logs and save them in Hive format.
The client side SDK provides all the classes you need in order to
generate/collect your unstructured and structured log events on your
application. Honu collectors are a key component and are responsible
for continuously saving log Events to HDFS. The Map/Reduce is
responsible for parsing log events and storing them in their
corresponding Hive table format.
The presentation will explain how we are using those components in
order to generate, collect and process all the unstructured and
structured log events and make them available to our end-user through
Hive.
Honu will be released under the standard Apache license and will be
hosted for now in gitHub. Honu has been running in production at
Netflix for more than 3 months and has some specific integrations to
leverage the Amazon EC2, S3 and EMR infrastructure.
Online Content Optimization with Hadoop
Amit Phadke, Yahoo!
One of the most interesting problems we work on is to provide the most relevant content to our users. This involves being able to track what are the interests of our users; mining the ever-changing content pool to see what is relevant, popular for our users. There is also content normalizing and de-duping issues to avoid redundancy. To solve all these problems, we make extensive use of Hadoop technology stack in our systems. Using Hadoop, we are able to scale to build models for millions of items, and users in near-real time. We leverage HBase for point lookups/stores of these models. We also use Pig for phrasing our workflows so the map-reduce parallelism is abstracted out of core processing.
Amit Phadke has been working at Yahoo! Since past 3.5 years. Since joining Yahoo!, he has been involved in various aspects of content optimization. Amit and his colleagues have built a scalable modeling infrastructure on top of Hadoop.
Hadoop Frameworks Panel: Pig, Hive, Cascading, Cloudera Desktop, LinkedIn Voldemort, Twitter ElephantBird
Moderator: Sanjay Radia, Yahoo!
A number of frameworks and tools have been built on top of Hadoop to make it easier to write applications and manage Hadoop.
The panel members consists of experts/developers of such frameworks and tools. The panel members will discuss:
1) The problem space and target audience their specific technology addresses
2) Their plans for the future enhancements
3) What is missing in the overall space.
Hadoop for Genomics
Jeremy Bruestle, Spiral Genetics
The field of genomics is of increasing importance to research and medicine. As the physical cost of DNA sequencing continues to drop, biologists are collecting ever larger data sets, requiring more sophisticated data processing. Hadoop is an excellent platform on which to build a consistent set of tools for genomics research. In this talk, we present a general framework for working with genomic data in Hadoop, and provide details on implementations for many common operations, including a novel mechanism for de novo DNA sequence assembly. In addition, we discuss how this open source genomics platform can be leveraged by researchers to reduce repeated effort and increase collaboration
TopHadoop Customer Panel: Amazon Elastic MapReduce
Moderator: Deepak Singh, Amazon Web Services
Many Amazon Web Services customers leverage Hadoop inside Amazon
Elastic MapReduce, to solve problems ranging from mining clickstream
data for targeted advertising, to scientific applications. In this
panel, Amazon Web Services customers will:
1. Discuss a diverse set of use cases where Hadoop is being applied today
2. Talk about the enterprise readiness of Hadoop
3. Talk about how Amazon Elastic MapReduce addresses some of the key challenges of running Hadoop in a production environment
4. Identify features and solutions that will lead to wider adoption
Parallel Distributed Image Stacking and Mosaicing with Hadoop
Keith Wiley, University of Washington
In the coming decade, astronomical surveys of the sky will generate tens of terabytes of images and detect hundreds of millions of sources every night. With a requirement that these images be analyzed in real time to identify moving sources such as potentially hazardous asteroids or transient objects such as supernovae, these data streams present many computational challenges. In this talk, we report on our experience implementing a scalable image-processing pipeline for the SDSS database using Hadoop. This multi-Terabyte imaging dataset provides a good testbed for algorithm development since its scope and structure approximate future surveys. Our pipeline performs two primary functions: stacking and mosaicing, in which multiple partially overlapping images are registered, integrated and stitched into a single overarching image. We will first present our initial implementation, then describe two critical optimizations that have enabled us to achieve high performance. The first optimization addresses the fact that the individual image files are quite small relative to the optimal Hadoop block size, while at the same time they incur significant job startup cost due to their large quantity. We improved Hadoop's handling of this data by converting the database to a set of Hadoop sequences files, each file comprising numerous image files. This optimization decreased the proportion of total jobtime spent on input-path processing at startup from 42% to 2% with an overall jobtime decrease of 260%. The second optimization involves the development of a hybrid relational-database/Hadoop system. In the initial implementation, many image files were considered but discarded at the mapper stage. By prepending the Hadoop job with a SQL-based metadata query, we can eliminate the same files from consideration before running the MapReduce job. We will report on the performance gains delivered by this pre-filtering step.
Top


