Large Scale Distributed Deep Learning on Hadoop Clusters
<p><i>By Cyprien Noel, Jun Shi and Andy Feng (@afeng76), Yahoo Big ML Team</i></p><hr><p><b>Introduction</b><br/></p><p>In the last 10 years, Yahoo has progressively invested in building and scaling Apache Hadoop clusters with a current footprint of more than 40,000 servers and 600 petabytes of storage spread across 19 clusters. As discussed at the <a href="http://www.slideshare.net/Hadoop_Summit/surge-rise-of-scalable-machine-learning-at-yahoo" target="_blank">2015 Hadoop Summit</a>, we have developed scalable machine learning algorithms on these clusters for classification, ranking, and word embedding based on a home-grown parameter server. Hadoop clusters have now become the preferred platform for large-scale machine learning at Yahoo.</p><figure data-orig-width="2274" data-orig-height="1291" class="tmblr-full"><img src="https://66.media.tumblr.com/c13777e7945f798c781abb6935e75faf/tumblr_inline_nv94pv3Q3V1t17fny_540.jpg" alt="image" data-orig-width="2274" data-orig-height="1291"/></figure><p>Deep learning (DL) is a critical capability demanded by many Yahoo products. At 2015 RE.WORK Deep Learning Summit, the Yahoo Flickr team (<a href="https://www.youtube.com/watch?v=4URpJS6Ny-o&list=PLnDbcXCpYZ8lCKExMs8k4PtIbani9ESX3&index=16" target="_blank">Simon Osindero</a> and<a href="https://www.youtube.com/watch?v=uUvLxmsHpr4&list=PLnDbcXCpYZ8lCKExMs8k4PtIbani9ESX3&index=7" target="_blank"> Pierre Garrigues</a>) explained how deep learning is getting applied for scene detection, object recognition, and computational aesthetics. Deep learning empowers Flickr to automatically tag all user photos, enabling Flickr end users to organize and find photos easily.</p><p>To enable more Yahoo products to benefit from the promise of deep learning, we have recently introduced this capability natively into our Hadoop clusters. Deep learning on Hadoop provides the following major benefits:<br/></p><ul><li>Deep learning can be directly conducted on Hadoop clusters, where Yahoo stores most of its data. We avoid unnecessary data movement between Hadoop clusters and separate deep learning clusters.</li><li>Deep learning can be defined as first-class steps in <a href="http://oozie.apache.org" target="_blank">Apache Oozie</a> workflows with Hadoop for data processing and Spark pipelines for machine learning.</li><li>YARN works well for deep learning. Multiple experiments of deep learning can be conducted concurrently on a single cluster. It makes deep learning extremely cost effective as opposed to conventional methods. In the past, we had teams use “notepad” to schedule GPU resources manually, which was painful and worked only for a small number of users. <br/></li></ul><p>Deep learning on Hadoop is a novel approach for deep learning. Existing approaches in the industry require dedicated clusters whereas Deep learning on Hadoop enables the same level of performance as with dedicated clusters while simultaneously providing all the benefits listed above.</p><p><b><br/></b></p><p><b>Enhancing Hadoop Clusters</b><br/></p><p>To enable deep learning, we added GPU nodes into our Hadoop clusters (illustrated below). Each of these nodes have 4 <a href="http://www.nvidia.com/object/tesla-servers.html" target="_blank">Nvidia Tesla K80 cards</a>, each card with two GK210 GPUs. These nodes have 10x processing power than the traditional commodity CPU nodes we generally use in our Hadoop clusters.</p><figure data-orig-width="4074" data-orig-height="2466" class="tmblr-full"><img src="https://66.media.tumblr.com/c3dd41925aa323837d2340ed952312c5/tumblr_inline_nv960xQE071t17fny_540.jpg" alt="image" data-orig-width="4074" data-orig-height="2466"/></figure><p><br/></p><p>In a Hadoop cluster, GPU nodes have two separate network interfaces, Ethernet and Infiniband. While Ethernet acts as the primary interface for external communication, Infiniband provides 10X faster connectivity among the GPU nodes in the cluster and supports direct access to GPU memories over RDMA.</p><p>By leveraging YARN’s recently introduced node label capabilities (<a href="https://issues.apache.org/jira/browse/YARN-796" target="_blank">YARN-796</a>), we enable jobs to state whether containers should be launched in CPU or GPU nodes. Containers on GPU nodes use Infiniband to exchange data at a very high speed.</p><p><b><br/></b></p><p><b>Distributed Deep Learning: Caffe-on-Spark</b></p><p>To enable deep learning on these enhanced Hadoop clusters, we developed a comprehensive distributed solution based upon open source software libraries, <a href="http://spark.apache.org/" target="_blank">Apache Spark</a> and <a href="http://caffe.berkeleyvision.org/" target="_blank">Caffe</a>. One can now submit deep learning jobs onto a cluster of GPU nodes via a command as illustrated below.</p><blockquote><p>spark-submit –master yarn –deploy-mode cluster</p><p> –files solver.prototxt, net.prototxt</p><p>–num-executors <# of EXECUTORS></p><p>–archives caffe_on_grid.tgz</p><p> –conf spark.executorEnv.LD_LIBRARY_PATH=“./caffe_on_grid.tgz/lib64”</p><p> –class com.yahoo.ml.CaffeOnSpark caffe-on-spark-1.0-jar-with-dependencies.jar</p><p> -devices <# of GPUs PER EXECUTOR></p><p> -conf solver.prototxt</p><p>-input hdfs://<TRAINING FILE></p><p> -model hdfs://<MODEL FILE></p></blockquote><p>In the command above, users specify the number of Spark executor processes to be launched (–num-executors), the number of GPUs to be allocated for each executor (-devices), the location of training data on HDFS, and the HDFS path where the model should be saved. Users use standard caffe configuration files to specify their caffe solver and deep network topology (ex. solver.prototxt, net.prototxt).</p><figure data-orig-width="4458" data-orig-height="3308" class="tmblr-full"><img src="https://66.media.tumblr.com/c454db08acfa53d5349374938ae76c1e/tumblr_inline_nv96ekzEOO1t17fny_540.jpg" alt="image" data-orig-width="4458" data-orig-height="3308"/></figure><p><br/></p><p>As illustrated above, Spark on YARN launches a number of executors. Each executor is given a partition of HDFS-based training data, and launches multiple Caffe-based training threads. Each training thread is executed by a particular GPU. After back-propagation processing of a batch of training examples, these training threads exchange the gradients of model parameters. The gradient exchanged is carried out in an MPI Allreduce fashion across all GPUs on multiple servers. We have enhanced Caffe to use multiple GPUs on a server and benefit from RDMA to synchronize DL models. </p><p>Caffe-on-Spark enables us to use the best of Spark and Caffe for large scale deep learning. DL tasks are launched easily as any other Spark application. Multiple GPUs in a cluster of machines are used to train models from HDFS-based large datasets.</p><p><b><br/></b></p><p><b>Benchmarks</b><br/></p><p>Caffe-on-Spark enables (a) multiple GPUs, and (b) multiple machines to be used for deep learning. To understand the benefits of our approach, we performed benchmarks on <a href="http://www.image-net.org/challenges/LSVRC/2012/" target="_blank">ImageNet 2012 dataset</a>.</p><p>First, we looked into the progress of deep learning for <a href="https://github.com/BVLC/caffe/tree/master/models/bvlc_alexnet" target="_blank">AlexNet</a> with 1 GPU, 2 GPUs, 4 GPUs and 8 GPUs with a single Spark executor. As illustrated in the diagram below, training time decreases as we add more GPUs. With 4 GPUs, we reached 50% accuracy in about 15/43=35% the time required by a single GPU. All these executions use identical total batch size 256. The setup with 8 GPUs didn’t show significant improvement over 4, as the overall batch size was too small on each GPU to use the hardware efficiently.</p><figure data-orig-width="6391" data-orig-height="5291" class="tmblr-full"><img src="https://66.media.tumblr.com/093852143aacab978fd89bbe47655dab/tumblr_inline_nv97bczCzO1t17fny_540.jpg" alt="image" data-orig-width="6391" data-orig-height="5291"/></figure><p>Next, we conducted a distributed benchmark with <a href="https://github.com/BVLC/caffe/tree/master/models/bvlc_googlenet" target="_blank">GoogLeNet</a>, which is much deeper and uses more convolutions than AlexNet, and thus requires more computation power. In every run, we arrange each GPU to handle batches of size 32, for an effective batch size of 32n when n GPUs are used. Our distributed algorithm is designed to produce models and end-result precision equivalent to running on a single GPU. 80% top-5 accuracy (20% error) was reached in 10 hours of training with 4 servers (4x8 GPUs). Notice that 1 GPU training reached only 60% top-5 accuracy (40% error) after 40 hours.</p><figure data-orig-width="6816" data-orig-height="5149" class="tmblr-full"><img src="https://66.media.tumblr.com/fa859c37c0784269180c5dc9a6122924/tumblr_inline_nv97borTcJ1t17fny_540.jpg" alt="image" data-orig-width="6816" data-orig-height="5149"/></figure><p>GoogLeNet scales further with the number of GPUs. For top-5 accuracy 60% (40% error), 8 GPUs achieved 680% speedup over 1 GPU. Table below also shows the speedup for top-5 accuracy 70% and 80%. The speedup could be larger if we adjust batch size carefully (instead of total batch size 32n). <br/></p><figure data-orig-width="716" data-orig-height="375" class="tmblr-full"><img src="https://66.media.tumblr.com/cc7cf8535ae707c9ca8a6cc5b759ee8d/tumblr_inline_nv96ui5Tod1t17fny_540.jpg" alt="image" data-orig-width="716" data-orig-height="375"/></figure><p><b>Open Source</b></p><p>Continuing Yahoo’s commitment to open source, we have released some of our code into <a href="https://github.com/BVLC/caffe" target="_blank">github.com/BVLC/caffe</a>:</p><ul><li><a href="https://github.com/BVLC/caffe/pull/2114" target="_blank">#2114</a> … Allow Caffe to use multiple GPUs within a computer</li><li><a href="https://github.com/BVLC/caffe/pull/1148" target="_blank">#1148</a> … RDMA transfers across computers</li><li><a href="https://github.com/BVLC/caffe/pull/2386" target="_blank">#2386</a> … Improved Caffe’s data pipeline and prefetching</li><li><a href="https://github.com/BVLC/caffe/pull/2395" target="_blank">#2395</a> … Added timing information</li><li><a href="https://github.com/BVLC/caffe/pull/2402" target="_blank">#2402</a> … Make Caffe’s IO dependencies optional</li><li><a href="https://github.com/BVLC/caffe/pull/2397" target="_blank">#2397</a> … Refactored Caffe solvers code</li></ul><p>In a follow-up post in the coming weeks, we will share the detailed design and implementation of Caffe-on-Spark. If there is enough interest from the community, we may open source our implementation. Please let us know what you think at <a href="mailto:bigdata@yahoo-inc.com" target="_blank">bigdata@yahoo-inc.com</a>.</p><p><br/></p><p><b>Conclusion</b></p><p>The post describes early steps in bringing Apache Hadoop ecosystem and deep learning together on the same heterogeneous (GPU+CPU) cluster. We are encouraged by the early benchmark results and plan to invest further in Hadoop, Spark, and Caffe to make deep learning more effective on our clusters. We look forward to working closely with the open source communities in related areas.</p>