• This is the beginning of an ongoing series of blog posts on “Managing Big Data”. This series will focus on techniques that Yahoo uses to process large volumes of data, ranging from initial collection of data to the end usage of that data.

    Introduction

    Over the last several years there are two important trends that require additional thought when putting together an architecture for a hosted service. At Yahoo!, the ability to analyze and process enormous amounts of data is increasingly important. It’s a foundational layer for improving our consumer experiences and for sharing audience insights with advertisers.

    From a technology perspective, the two trends I'd like to focus on are:

    1. Batch processing -- the increasing awareness of batch processing and the recent uptick in use of the map/reduce paradigm for that purpose.

    2. NoSQL stores – The rise of so called "NoSQL" stores and their use to serve up data to online users (typically inside of the user's request/response

    Read More »from Managing Big Data: Architectural Approaches for making batch data available online
  • Hadoop and the fight against shape-shifting spam

    At a recent Hadoop User Group meeting, I made a presentation on how we leverage hadoop for spam mitigation in Yahoo! Mail. A number of people followed up requesting additional details of our architecture and engineering strategy.


    In this post, I am going to try and capture our antispam engineering story, how it came to be hadoop centric and how well the new architecture has worked. I will also highlight the results we have been able to achieve. Finally, I will provide an update on when we will be releasing these updates to wide production.

    At the Hadoop User group presentation, I had delved into the details of two interesting antispam algorithms. The first was "frequent itemset mining", the second was what we called the "connected components" algorithm. Both these algorithms are implemented as part of our tools portfolio. They are used by engineers, product managers and operations analysts to get a compact summary of the major trends in spam. Both these tools were implemented as

    Read More »from Hadoop and the fight against shape-shifting spam
  • At Yahoo!, the ability to analyze and process enormous amounts of data is increasingly important. It’s a foundational layer for improving our consumer experiences and for sharing audience insights with advertisers.

    In the last few years, I have been a part of a project to design, build, and run a low-latency, large-scale, distributed event data collection system at Yahoo!. When we started off, the goal seemed relatively unambitious, to collect web-access event data across all of the web-servers across all the data centers and bring it to a central location for processing. This perception soon changed after we realized that this involved around 20000 machines and over 20 data centers across the world amounting to over 40 billion events per day that helped fill-up over 10 TB of disk space. To add to the mix, the data had to be available within 15 minutes with an expected completeness of 99% across trans-oceanic fiber optic cable.

    We decided to collect the data in a streaming fashion.

    Read More »from Enabling Hadoop Batch Processing Systems to Consume Streaming Data
  • Hadoop Summit 2010 – Agenda is available!

    I’m happy to share the agenda for the upcoming Hadoop Summit – June 29th, Hyatt, Santa Clara.

    We received over 70 great submissions for talks. It was a very impressive combination of development tool overviews, application case studies and innovative research.

    We had the difficult task of selecting just a handful of presentations from this overwhelming collection of great quality abstracts and speakers. The variety of topics across numerous industry verticals, served as a clear evidence of how far this technology has evolved over the past year. Hadoop is really going mainstream!

    Our goal was to create a diverse agenda that covers topics for experienced Hadoop users as well as people who recently began to explore this technology. We wanted to focus on the Hadoop eco-system of tools and solutions as well as “real life” users experience.

    I want to thank all the people who submitted talks and encourage speakers that were not selected to submit their great presentations to our

    Read More »from Hadoop Summit 2010 – Agenda is available!
  • Hi Hadoopers

    Thanks to close to 300 developers who came this week to Yahoo! for our monthly Hadoop User Group meeting. The energy in the packed room was phenomenal and conversations continued long after the formal sessions.

    >Hundreds of Hadoop Fans Flock to Yahoo! for  the May Hadoop User Group
    Hundreds of Hadoop Fans Flock to Yahoo! for the May Hadoop User Group

    A few lucky winners received free tickets to the upcoming Hadoop Summit 2010 (June 29th, at the Hyatt Regency, Santa Clara). Congratulations to those winners – everyone else please register here

    The event started with Alan Gates from Yahoo! who described the new features and work done in Pig 0.6 and 0.7 including the Hadoop’s compatibility plan, described in more details in this post.

     

    Nathan Marz from BackType presented a cool demo of how easy it is to query existing data stores using Cascalog, a query language for Hadoop. Nathan described how queries can be written as regular Clojure code and combined with Cascading. Be sure to watch the demo as part of the video below.

    Read More »from Pig, Cascalog & HBase Among Highlights of May Hadoop Meet-Up
  • Towards Enterprise-Class Compatibility for Apache Hadoop

    At Yahoo!, the first users of Apache Hadoop were researchers developing new algorithms or manually shifting through huge data sets. These users threw away most of their code after a few weeks or months, and the little code they carried forward was not subject to rigorous quality procedures. Thus, these early users cared more about new features and scalability improvements in Hadoop than they did about backward compatibility.

    This early focus on bigger-and-better helped Hadoop become the powerful platform it is today. However, over the years, both inside and outside of Yahoo!, Hadoop is increasingly being used to run large, long-lived, enterprise-class applications. Porting these applications to non-compatible upgrades of Hadoop is an arduous, expensive task that distracts teams from finding new and better ways of using Hadoop to bring value to their companies. Today, Hadoop users are demanding backwards compatibility and interface stability; these features are necessary for the

    Read More »from Towards Enterprise-Class Compatibility for Apache Hadoop
  • Scalability of the Hadoop Distributed File System

    In his fictional story "The Library of Babel", Jorge Luis Borges describes a vast storage universe composed of all possible manuscripts uniformly formatted as 410-page books. Most are random meaningless sequences of symbols. But the rest excitingly forms a complete and an indestructible knowledge system, which stores any text written in the past or to be written in the future, thus providing solutions to all problems in the world. Just find the right book.

    The same characteristic fascinates us in modern storage growth: The aggregation of information directly leads to proportional growth of new knowledge discovered out of it. A skeptic may doubt that further reward in knowledge mining will not justify the effort in information aggregation. What if by building sophisticated storage systems we are chasing the 19th century’s horse manure problem, when at the dawn of the automobile era the scientific world was preoccupied with the growth of the horse population that threatened to bury the

    Read More »from Scalability of the Hadoop Distributed File System
  • Hi Hadoopers

    Thanks to more than 250 developers who came tonight to Yahoo! for our monthly Hadoop User Group meeting. With Facebook's F8 developer conference and the downpour of April showers it was nice to see such turnout.

    A few lucky winners received free tickets to the upcoming Hadoop Summit 2010 (June 29th, at the Hyatt Regency, Santa Clara). Congratulations to those winners – everyone else please register here

    The event started with Vishwanath Ramarao, Director of anti-spam engineering for Yahoo! Mail. Vish described the intricate cat-and-mouse games played with spammers, and how Yahoo! uses Hadoop to abstract away the complexity of large scale data analysis and provide deep insight into spammer campaigns.

     

    Next was a presentation from John Sichi, lead engineer for Facebook's data infrastructure team. John provided an overview of Facebook's recent integration between Hadoop, HBase and Hive and the motivation for it - "Data, data, and more data".

     

    We

    Read More »from Hundreds of Hadoop Fans Flock to Yahoo! for the April Hadoop User Group Meet-Up
  • At Yahoo!, grids running Hadoop have attracted a wide range of applications from a diverse set of functional groups. The workload submitted by each group is distinguished not only from others in the cluster, but its profile also changes as users gain experience with Hadoop. Users tune their jobs to consume available resources; some may circumvent the assumptions and fairness control mechanisms of the MapReduce scheduler, at the expense of more timid workloads.

    Tracking, modeling, and mimicking this adaptive, complex workload is a prerequisite to effective performance engineering.

    A Brief History of Gridmix

    Until recently, most of our work has been based on the existing, de facto benchmark for Hadoop, Gridmix (and its enhancement Gridmix2). Gridmix consists the following parts:

    • A data generation script that must be run once to generate data needed for running the actual benchmark.
    • Several types of jobs in the "mix". The particular jobs varied slightly between Gridmix and
    Read More »from Gridmix3 – Emulating Production Workload for Apache Hadoop
  • Hadoop Bay Area User Group – April 21st at Yahoo!

    Hi Hadoopers,

    I'm excited to invite you to the next Hadoop Bay Area User Group, April 21st, 6PM at the Yahoo! Sunnyvale Campus.

    The Hadoop community is growing at a considerable rate. We had more than 200 attendees at the last meetup.
    It was our largest turnout ever and people stuck around until 9 pm.

    We invite you to attend whether you are an active submitter, developing Hadoop-based applications or completely new to the Apache Hadoop world. In addition to interesting presentations you will enjoy food, beer and great networking.

    We have a diverse plan for this event, comprised of three sessions:

    • Vishwanath Ramarao, Director of Engineering for Yahoo! Mail, will discuss the extensive use of Hadoop to detect and mitigate spam attacks. In this exciting session titled "Yokai and the Elephant", Vishwanath will describe the cat-and-mouse fight against insidious spammers and the considerable impact of Hadoop in areas of abuse and security.

       

    • John Sichi will discuss Facebook's recent
    Read More »from Hadoop Bay Area User Group – April 21st at Yahoo!

Pagination

(104 Stories)