Storm and Hadoop: Convergence of Big-Data and Low-Latency Processing

At Yahoo!, Hadoop plays a central role in providing personalized experiences for our users and creating value for our advertisers. To serve Yahoo!’s emerging business needs, the Cloud Engineering Group is working on a next generation platform that enables the convergence of big-data and low-latency processing.

Figure 1. Personalization based on User Interests

Figure 1. Personalization based on User Interests

Yahoo! is enhancing its web properties and mobile applications to provide its users personalized experience based on interest profiles. To compute user interest, we process billions of events from our over 700 million users, and analyze 2.2 billion content every day. Since users' change interest over time, we need to update user profiles to reflect their current interests. Figure 1 illustrates a conceptual architecture that describes how low-latency processing and batch processing are leveraged to update users' interest profile for personalization.

Figure 2. Convergence of batch and low-latency processing

Figure 2. Convergence of batch and low-latency processing

Enabling low-latency big-data processing is one of the primary design goals of Yahoo!’s next-generation big-data platform. While MapReduce is a key design pattern for batch processing, additional design patterns will be supported over time. Stream/micro-batch processing is one of design patterns applicable to many Yahoo! use cases. In Q1 2013, we added Storm as a new service to our big-data platform. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for stream/micro-batch processing. As illustrated in Figure 2, our big-data platform enables Hadoop applications and Storm applications to share data via shared storage such as HBase.

Yahoo! engineering teams are developing technologies to enable Storm applications and Hadoop applications to be hosted on a single cluster.

  • We have enhanced Storm to support Hadoop style security mechanisms (including Kerberos authentication), and thus enable Storm applications to access Hadoop datasets stored in our secure HDFS and Hbase clusters.
  • Storm is being integrated into Hadoop YARN for resource management. Storm-on-YARN enables Storm applications to utilize the computational resources in our tens of thousands of Hadoop computation nodes. YARN is used to launch the Storm
    application master (Nimbus) on demand, and enables Nimbus to request resources for Storm application slaves (Supervisors).

Yahoo! is committed to working with the open source community on big-data processing. To enable the convergence of low-latency big-data processing, Yahoo! is making our contribution on Storm and YARN available as open source. Additional details on these efforts will be shared at our HUG meetup in April 2013 and Hadoop Summit North America in June, 2013.

------------

Andy Feng is a Distinguished Architect at Yahoo! and a Core Contributor of Storm project.   He lead architecture design and development of next-gen big-data platform, which empowers variety application patterns (Batch, Microbatch, Streaming, Query).