By Bobby Evans and Andy Feng (@afeng76)
At Yahoo! we have worked on the convergence of Storm with Hadoop, as mentioned in our earlier post. We are pleased to announce that Storm-YARN has been released as open source. Storm-YARN enables Storm applications to utilize the computational resources in a Hadoop cluster along with accessing Hadoop storage resources such as HBase and HDFS.
Collocating real-time processing with batch processing offers a number of advantages over segregated clusters.
- It provides a huge potential for elasticity. Real-time processing will rarely produce a constant and predictable load. As such, Storm needs more resources to keep up with spikes in demand. Collocating Storm with batch processing allows Storm to steal resources from batch jobs when needed and give them back when demand subsides. The Storm-YARN effort lays the groundwork to make this possible.
- Many applications use Storm for low-latency processing and Map/Reduce for batch processing while sharing data between Storm and Map/Reduce. By placing Storm physically closer to the data source and/or other components in the same pipeline we can reduce network transfers and in turn the total cost of acquiring the data.
Launch Storm Cluster
To launch a Storm cluster managed by YARN, you simply execute:
storm-yarn launch <storm-yarn.yaml>
storm-yarn.yaml is the standard storm configuration file with YARN specification parameters including
master.initial-num-supervisors (the initial number of supervisors to be launched) and
master.container.size-mb (the memory size of the container to be allocated for each supervisor).
Figure 1 illustrates the execution of
storm-yarn command. Storm-YARN asks YARN’s Resource Manager to launch a Storm Application Master. The Application Master then launches a Storm nimbus server and a Storm UI server locally. It also uses YARN to find resources for the supervisors and launch them.
Figure 1: Launch Storm Cluster with Hadoop YARN
Execute Storm Topologies
You can communicate with the Storm cluster the same as with a standalone Storm cluster, through the
storm jar <topology_jar>
Because nimbus is running on a node picked by YARN, you may need to specify that node on the command line by setting the
As illustrated in Figure 2, each Storm supervisor will launch worker processes within its container. These Storm worker processes are enabled to access Hadoop datasets stored in HDFS and HBase etc..
Figure 2 Submit and Execute Storm Topologies
Open Source Release
Yahoo! has decided to release Storm-YARN code under the Apache 2.0 License. The code is available at https://github.com/yahoo/storm-yarn. This alpha release enables members of the community to jointly make Storm-YARN a high-quality product. Please try it out and let us know what you think.
If you are interested in contributing, please feel free to submit proposals as issues, sign an Apache style CLA and contribute your code.
Additional details on Storm-YARN will be shared during our Storm-on-YARN: Convergence of Low-Latency and Big-Data talk at the 2013 Hadoop Summit North America on June 26, 2013, 11:20 am under the Future of Apache Hadoop track. We look forward to seeing you there.
Derek Dagit has implemented significant portion of Storm-Yarn release. We thank him for making this early release avaialble for open source.
Bobby Evans is a software developer at Yahoo! and a Hadoop PMC member at the Apache Software Foundation.
Andy Feng is a Distinguished Architect at Yahoo! and a Core Contributor of Storm project. He lead architecture design and development of next-gen big-data platform, which empowers variety application patterns (Batch, Microbatch, Streaming, Query).