CaffeOnSpark Open Sourced for Distributed Deep Learning on Big Data Clusters
<p><b></b><i>By Andy Feng(@afeng76), Jun Shi and Mridul Jain (@mridul_jain), Yahoo Big ML Team</i><br/></p><hr><p><b>Introduction</b><br/></p><p>Deep learning (DL) is a critical capability required by Yahoo product teams (ex. <a href="https://www.youtube.com/watch?v=uUvLxmsHpr4&list=PLnDbcXCpYZ8lCKExMs8k4PtIbani9ESX3&index=7" target="_blank">Flickr</a>, Image Search) to gain intelligence from massive amounts of online data. Many existing DL frameworks require a separated cluster for deep learning, and multiple programs have to be created for a typical machine learning pipeline (see Figure 1). The separated clusters require large datasets to be transferred among them, and introduce unwanted system complexity and latency for end-to-end learning.</p><figure data-orig-width="5427" data-orig-height="2124" class="tmblr-full"><img src="https://66.media.tumblr.com/b21c37c0fb9269699fbeb3c68d1b9242/tumblr_inline_o2vnzoXL1b1t17fny_540.jpg" alt="image" data-orig-width="5427" data-orig-height="2124"/></figure><p><b></b></p><p><b>Figure 1: </b>ML Pipeline with multiple programs on separated clusters<br/></p><p><br/></p><p>As discussed in our earlier <a href="http://yahoohadoop.tumblr.com/post/129872361846/large-scale-distributed-deep-learning-on-hadoop" target="_blank">Tumblr post</a>, we believe that deep learning should be conducted in the same cluster along with existing data processing pipelines to support feature engineering and traditional (non-deep) machine learning. We created CaffeOnSpark to allow deep learning training and testing to be embedded into Spark applications (see Figure 2). </p><figure data-orig-width="3166" data-orig-height="2114" class="tmblr-full"><img src="https://66.media.tumblr.com/d4fd5b7a14bd282e09f79b1c2af436e6/tumblr_inline_o2vo44cIg51t17fny_540.jpg" alt="image" data-orig-width="3166" data-orig-height="2114"/></figure><p><b></b></p><p><b>Figure 2:</b> ML Pipeline with single program on one cluster</p><p><br/></p><p><b>CaffeOnSpark: API & Configuration and CLI</b><br/></p><p><b></b></p><p>CaffeOnSpark is designed to be a Spark deep learning package. <a href="http://spark.apache.org/docs/latest/mllib-guide.html" target="_blank">Spark MLlib</a> supported a variety of non-deep learning algorithms for classification, regression, clustering, recommendation, and so on. Deep learning is a key capacity that Spark MLlib lacks currently, and CaffeOnSpark is designed to fill that gap. <a href="http://yahoo.github.io/CaffeOnSpark" target="_blank">CaffeOnSpark API</a> supports <a href="http://spark.apache.org/docs/latest/sql-programming-guide.html" target="_blank">dataframes</a> so that you can easily interface with a training dataset that was prepared using a Spark application, and extract the predictions from the model or features from intermediate layers for results and data analysis using MLLib or SQL.</p><figure data-orig-width="3385" data-orig-height="1916" class="tmblr-full"><img src="https://66.media.tumblr.com/ca004ff94918dca10d533cd118c8a914/tumblr_inline_o2vo8lJDh81t17fny_540.jpg" alt="image" data-orig-width="3385" data-orig-height="1916"/></figure><p><b>Figure 3: </b>CaffeOnSpark as a Spark Deep Learning package<br/></p><p><br/></p><blockquote><p>1: def main(args: Array[String]): Unit = {</p><p> 2: val ctx = new SparkContext(new SparkConf())</p><p> 3: val cos = new <a href="http://yahoo.github.io/CaffeOnSpark/scala_doc/#com.yahoo.ml.caffe.CaffeOnSpark" target="_blank">CaffeOnSpark</a>(ctx)</p><p> 4: val conf = new <a href="http://yahoo.github.io/CaffeOnSpark/scala_doc/#com.yahoo.ml.caffe.Config" target="_blank">Config</a>(ctx, args).init()</p><p> 5: val dl_train_source = <a href="http://yahoo.github.io/CaffeOnSpark/scala_doc/#com.yahoo.ml.caffe.DataSource%24" target="_blank">DataSource.getSource</a>(conf, true)</p><p> 6: cos.<a href="http://yahoo.github.io/CaffeOnSpark/scala_doc/#com.yahoo.ml.caffe.CaffeOnSpark" target="_blank">train</a>(dl_train_source)</p><p> 7: val lr_raw_source = <a href="http://yahoo.github.io/CaffeOnSpark/scala_doc/#com.yahoo.ml.caffe.DataSource%24" target="_blank">DataSource.getSource</a>(conf, false)</p><p> 8: val extracted_df = cos.<a href="http://yahoo.github.io/CaffeOnSpark/scala_doc/#com.yahoo.ml.caffe.CaffeOnSpark" target="_blank">features</a>(lr_raw_source)</p><p> 9: val lr_input_df = extracted_df.withColumn(“Label”, cos.floatarray2doubleUDF(extracted_df(conf.label)))</p><p>10: .withColumn(“Feature”, cos.floatarray2doublevectorUDF(extracted_df(conf.features(0))))</p><p>11: val lr = new LogisticRegression().setLabelCol(“Label”).setFeaturesCol(“Feature”)</p><p>12: val lr_model = lr.fit(lr_input_df)</p><p> 13: lr_model.write.overwrite().save(conf.outputPath)</p><p> 14: }</p></blockquote><p><b></b></p><p><b>Figure 4: </b><a href="https://github.com/yahoo/CaffeOnSpark/blob/master/caffe-grid/src/main/scala/com/yahoo/ml/caffe/examples/MyMLPipeline.scala" target="_blank">Scala application</a> using CaffeOnSpark both MLlib<br/></p><p><br/></p><p>Scala program in Figure 4 illustrates how CaffeOnSpark and MLlib work together:</p><ul><li>L1-L4 … You initialize a Spark context, and use it to create CaffeOnSpark and configuration object.<br/></li><li>L5-L6 … You use CaffeOnSpark to conduct DNN training with a training dataset on HDFS.<br/></li><li>L7-L8 …. The learned DL model is applied to extract features from a feature dataset on HDFS.<br/></li><li>L9-L12 … MLlib uses the extracted features to perform non-deep learning (more specifically logistic regression for classification).<br/></li><li>L13 … You could save the classification model onto HDFS.</li></ul><p><b></b></p><p>As illustrated in Figure 4, CaffeOnSpark enables deep learning steps to be seamlessly embedded in Spark applications. It eliminates unwanted data movement in traditional solutions (as illustrated in Figure 1), and enables deep learning to be conducted on big-data clusters directly. Direct access to big-data and massive computation power are critical for DL to find meaningful insights in a timely manner.</p><p>CaffeOnSpark uses the configuration files for solvers and neural network as in standard Caffe uses. As illustrated in our <a href="https://github.com/yahoo/CaffeOnSpark/blob/master/data/lenet_memory_train_test.prototxt" target="_blank">example</a>, the neural network will have a MemoryData layer with 2 extra parameters:</p><ol><li><b>source_class</b> specifying a data source class<br/></li><li><b>source</b> specifying dataset location.<br/></li></ol><p>The initial CaffeOnSpark release has several built-in data source classes (including <a href="http://yahoo.github.io/CaffeOnSpark/scala_doc/#com.yahoo.ml.caffe.LMDB" target="_blank">com.yahoo.ml.caffe.LMDB</a> for <a href="http://symas.com/mdb/doc/" target="_blank">LMDB</a> databases and <a href="http://yahoo.github.io/CaffeOnSpark/scala_doc/#com.yahoo.ml.caffe.SeqImageDataSource" target="_blank">com.yahoo.ml.caffe.SeqImageDataSource</a> for <a href="http://wiki.apache.org/hadoop/SequenceFile" target="_blank">Hadoop sequence files</a>). Users could easily introduce customized data source classes to interact with the existing data formats. </p><p><b></b></p><p>CaffeOnSpark applications will be launched by standard Spark commands, such as spark-submit. Here are 2 examples of spark-submit commands. The first command uses CaffeOnSpark to train a DNN model saved onto HDFS. The second command is a custom Spark application that embedded CaffeOnSpark along with MLlib.</p><p>First command:<br/></p><blockquote><p><i>spark-submit \<br/> –files caffenet_train_solver.prototxt,caffenet_train_net.prototxt \<br/> –num-executors 2 \<br/> –class com.yahoo.ml.caffe.CaffeOnSpark \<br/> caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \<br/> -train -persistent \<br/> -conf caffenet_train_solver.prototxt \<br/> -model hdfs:///sample_images.model \<br/> -devices 2</i><br/><br/></p></blockquote><p>Second command:<br/></p><p><b></b></p><blockquote><p><i>spark-submit \<br/> –files caffenet_train_solver.prototxt,caffenet_train_net.prototxt \<br/> –num-executors 2 \<br/> –class com.yahoo.ml.caffe.examples.MyMLPipeline \ </i></p><p><i> </i><i><i>caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar \</i></i></p><p><i> -features fc8 \<br/> -label label \<br/> -conf caffenet_train_solver.prototxt \<br/> -model hdfs:///sample_images.model \<br/> -output hdfs:///image_classifier_model \<br/> -devices 2</i></p></blockquote><p><b><br/></b></p><p><b>System Architecture</b><br/></p><figure data-orig-width="5572" data-orig-height="4135" class="tmblr-full"><img src="https://66.media.tumblr.com/7ccf1da404389778a4014340abd7dcb3/tumblr_inline_o2voaoqngI1t17fny_540.jpg" alt="image" data-orig-width="5572" data-orig-height="4135"/></figure><p><b>Figure 5: </b>System Architecture<br/></p><p><br/></p><p>Figure 5 describes the system architecture of CaffeOnSpark. We launch Caffe engines on GPU devices or CPU devices within the Spark executor, via invoking a JNI layer with fine-grain memory management. Unlike traditional Spark applications, CaffeOnSpark executors communicate to each other via MPI allreduce style interface via TCP/Ethernet or RDMA/Infiniband. This Spark+MPI architecture enables CaffeOnSpark to achieve similar performance as dedicated deep learning clusters.</p><p>Many deep learning jobs are long running, and it is important to handle potential system failures. CaffeOnSpark enables training state being snapshotted periodically, and thus we could resume from previous state after a failure of a CaffeOnSpark job. </p><p><b><br/>Open Source<br/></b></p><p>In the last several quarters, Yahoo has applied CaffeOnSpark on several projects, and we have received much positive feedback from our internal users. Flickr teams, for example, made significant improvements on image recognition accuracy with CaffeOnSpark by training with millions of photos from the Yahoo Webscope <a href="http://webscope.sandbox.yahoo.com/catalog.php?datatype=i&did=67" target="_blank">Flickr Creative Commons 100M dataset</a> on Hadoop clusters. </p><p>CaffeOnSpark is beneficial to deep learning community and the Spark community. In order to advance the fields of deep learning and artificial intelligence, Yahoo is happy to release CaffeOnSpark at <a href="https://github.com/yahoo/CaffeOnSpark" target="_blank">github.com/yahoo/CaffeOnSpark</a> under Apache 2.0 license. </p><p>CaffeOnSpark can be tested on an <a href="https://github.com/yahoo/CaffeOnSpark/wiki/GetStarted_EC2" target="_blank">AWS EC2</a> cloud or on <a href="https://github.com/yahoo/CaffeOnSpark/wiki/GetStarted_local" target="_blank">your own Spark clusters</a>. Please find the detailed instructions at Yahoo <a href="https://github.com/yahoo/CaffeOnSpark" target="_blank">github</a> repository, and share your feedback at <a href="mailto:bigdata@yahoo-inc.com" target="_blank">bigdata@yahoo-inc.com</a>. Our goal is to make CaffeOnSpark widely available to deep learning scientists and researchers, and we welcome contributions from the community to make that happen. . </p>