Blog Posts by Yahoo! Developer Network

  • Hadoop2010: Efficient Parallel Set-Similarity Joins

    allowFullScreen='true' src='https://s.yimg.com/m/up/ypp/default/player.swf' flashvars='vid=21232234&autoPlay=0'>

    iPod: Download high-resolution version

    A set-similarity join (SSJ) finds pairs of set-based records such that each pair is similar enough based on a similarity function and a threshold. Many applications require efficient SSJ solutions, such as record linkage and plagiarism detection. This talk studies how to efficiently perform SSJs on large data sets using Hadoop. It proposes a 3-stage approach to the problem, to efficiently partition the data across nodes to balance the workload and minimize the need for replication. It reports results from extensive experiments on real datasets, synthetically increased in size, to evaluate the speedup and scaleup properties of the proposed algorithms using Hadoop.

    Baycat logo
    Media Production by BAYCAT, a non-profit community media producer that educates and employs underserved youth and adults in the digital media arts.

    Read More »from Hadoop2010: Efficient Parallel Set-Similarity Joins
  • Hadoop2010: Integration Patterns & Practices

    allowFullScreen='true' src='https://s.yimg.com/m/up/ypp/default/player.swf' flashvars='vid=21232264&autoPlay=0'>

    iPod: Download high-resolution version

    Hadoop is a powerful platform for data analysis and processing, but many struggle to understand how it fits in with regard to existing infrastructure and systems. A series of common integration points, technologies, and patterns are defined and illustrated in this presentation. Eric Sammer looks at job initiation, sequencing and scheduling, data input from various sources (e.g., DBMS, messaging systems), and data output to various sinks (DBMS, messaging systems, caching systems). You will see how integration patterns and best practices can be applied to Hadoop and its related projects. This talk is focused on the suitability and architecture of these integration patterns. Care is taken to not duplicate talks on specific tools that are likely to be covered by other talks.

    Baycat logo
    Media Production by BAYCAT, a non-profit community media
    Read More »from Hadoop2010: Integration Patterns & Practices
  • Hadoop2010: Online Content Optimization

    allowFullScreen='true' src='https://s.yimg.com/m/up/ypp/default/player.swf' flashvars='vid=21232247&autoPlay=0'>

    iPod: Download high-resolution version

    One of the most interesting problems we work on at Yahoo! is to provide the most relevant content to our users. This involves being able to track what are the interests of our users; mining the ever-changing content pool to see what is relevant, popular for our users. There is also content normalizing and de-duping issues to avoid redundancy. To solve all these problems, we make extensive use of Hadoop technology stack in our systems. Using Hadoop, we are able to scale to build models for millions of items, and users in near-real time. We leverage HBase for point lookups/stores of these models. We also use Pig for phrasing our workflows so the map-reduce parallelism is abstracted out of core processing.

    Media Production by BAYCAT, a non-profit community media producer that educates and employs underserved youth and adults in the
    Read More »from Hadoop2010: Online Content Optimization
  • Hadoop2010: Cascalog Query Language

    allowFullScreen='true' src='https://s.yimg.com/m/up/ypp/default/player.swf' flashvars='vid=21232266&autoPlay=0'>

    iPod: Download high-resolution version

    Cascalog is an interactive query language for Hadoop with a focus on simplicity, expressiveness, and flexibility intended to be used by Analysts and Developers alike. Cascalog eschews the SQL syntax for a simpler and more expressive syntax based on Datalog. With this added expressiveness, Cascalog can query existing data stores "out of the box" with no data "importing" or "under the hood" configuration necessary. Because Cascalog sits on top of Clojure, a powerful JVM based language and interactive shell, adding new operations to a query is as simple as defining a new function. Cascalog relies on Cascading, a robust data-processing API, for defining and running workflows.

    Media Production by BAYCAT, a non-profit community media producer that educates and employs underserved youth and adults in the digital media arts.

    Read More »from Hadoop2010: Cascalog Query Language
  • Hadoop2010: Data Apps & Infrastructure at LinkedIn

    allowFullScreen='true' src='https://s.yimg.com/m/up/ypp/default/player.swf' flashvars='vid=21232270&autoPlay=0'>

    iPod: Download high-resolution version

    LinkedIn runs a number of large-scale Hadoop calculations to power its features — from computing similar profiles, jobs, and companies, to predicting People You May Know recommendations to help users find their professional connections. This talk covers how Hadoop fits into a production data cycle for a consumer-scale social network, including some of the technology, infrastructure, and algorithms for calculating tens of billions of predictions in a social graph.

    Media Production by BAYCAT, a non-profit community media producer that educates and employs underserved youth and adults in the digital media arts.

    Read More »from Hadoop2010: Data Apps & Infrastructure at LinkedIn
  • Hadoop2010: Parallel Image Stacking

    allowFullScreen='true' src='https://s.yimg.com/m/up/ypp/default/player.swf' flashvars='vid=21484216&autoPlay=0'>

    iPod: Download high-resolution version

    Keith Wiley, University of Washington, talks about parallel distributed image stacking and mosaicing with Hadoop, and reports on his experience implementing a scalable image-processing pipeline for the SDSS database using Hadoop. This multi-Terabyte imaging dataset provides a good testbed for algorithm development since its scope and structure approximate future surveys. His pipeline performs two primary functions: stacking and mosaicing, in which multiple partially overlapping images are registered, integrated and stitched into a single overarching image. He discusses two implementations, with the latter prepending the Hadoop job with a SQL-based metadata query, thus eliminating the same files from consideration before running the MapReduce job.

    Media Production by BAYCAT, a non-profit community media producer that educates and
    Read More »from Hadoop2010: Parallel Image Stacking
  • Hadoop2010: XXL Graph Algorithms

    allowFullScreen='true' src='https://s.yimg.com/m/up/ypp/default/player.swf' flashvars='vid=21232308&autoPlay=0'>

    iPod: Download high-resolution version

    The MapReduce framework is now a de facto standard for massive dataset computations. However, many of the elementary graph algorithms are inherently sequential and appear to be hard to parallelize (often requiring number of rounds proportional to the diameter of the graph). In this talk, Sergei Vassilvitskii, Yahoo! Labs, describes a different approach, called filtering, to implementing fundamental graph algorithms, like computing connected components and minimum spanning trees. He notes that filtering can also be applied to speed up general clustering algorithms like k-means. Finally, he describes how to apply the technique to find tight-knit friend groups in a social network.

    Media Production by BAYCAT, a non-profit community media producer that educates and employs underserved youth and adults in the digital media arts.

    Read More »from Hadoop2010: XXL Graph Algorithms
  • Hadoop2010: Hadoop for Genomics

    allowFullScreen='true' src='https://s.yimg.com/m/up/ypp/default/player.swf' flashvars='vid=21120816&autoPlay=0'>

    iPod: Download high-resolution version

    The field of genomics is of increasing importance to research and medicine. As the physical cost of DNA sequencing continues to drop, biologists are collecting ever larger data sets, requiring more sophisticated data processing. Hadoop is an excellent platform on which to build a consistent set of tools for genomics research. In this talk, Jeremy presents a general framework for working with genomic data in Hadoop, and provide details on implementations for many common operations, including a novel mechanism for de novo DNA sequence assembly. Hhe discusses how this open source genomics platform can be leveraged by researchers to reduce repeated effort and increase collaboration

    Media Production by BAYCAT, a non-profit community media producer that educates and employs underserved youth and adults in the digital media arts.

    Read More »from Hadoop2010: Hadoop for Genomics
  • Hadoop2010: Biometric Databases and Hadoop

    allowFullScreen='true' src='https://s.yimg.com/m/up/ypp/default/player.swf' flashvars='vid=21120805&autoPlay=0'>

    iPod: Download high-resolution version

    Over the next few years Biometric databases for the Federal Bureau of Investigations (FBI), Department of State (DoS), Department of Defense (DoD) and the Department of Homeland Security (DHS) are expected to grow to accommodate hundreds of millions, if not billions, of identities. As these biometric systems grow, distributed computing platforms such as Hadoop/MapReduce may be a feasible solution to this potential problem. This presentation outlines the magnitude of the problem, and evaluates algorithms and solutions using Hadoop and MapReduce. Open source biometric algorithms include BOZORTH3 (fingerprint matching) and IrisCode (Iris Scan matching), and how they can be optimized for deployment over Hadoop/MapReduce are also discussed.

    Media Production by BAYCAT, a non-profit community media producer that educates and employs
    Read More »from Hadoop2010: Biometric Databases and Hadoop
  • Hadoop2010: Addressing Hadoop Headaches

    allowFullScreen='true' src='https://s.yimg.com/m/up/ypp/default/player.swf' flashvars='vid=21120818&autoPlay=0'>

    iPod: Download high-resolution version

    In this session, Shevek discusses and presents solutions to the challenges most frequently encountered during development and deployment of MapReduce applications. The session content reflects the result of scouring customer mailing lists, forums, and user comments for the most common application development problems and questions. Be ready to walk away with practical answers to the most frequently encountered problems with developing MapReduce and Hadoop applications -- and some uncommon ones you may hit as well.

    Media Production by BAYCAT, a non-profit community media producer that educates and employs underserved youth and adults in the digital media arts.

    Read More »from Hadoop2010: Addressing Hadoop Headaches

Pagination

(91 Stories)