Presenting the Latest in Hadoop
<p><figure data-orig-width="800" data-orig-height="165" class="tmblr-full"><img src="https://66.media.tumblr.com/9fe722cb5f25dee7b1c1f6b9c066211d/tumblr_inline_ochgtiP3q81t17fny_540.jpg" alt="image" data-orig-width="800" data-orig-height="165"/></figure></p><p>If you are an avid Hadoop user – or even just getting started – there is a place in Silicon Valley you can go approximately once a quarter to learn and ask questions about the latest in the technology. That place is the Bay Area Hadoop User Group (HUG), and last week we hosted our <a href="http://www.meetup.com/hadoop/events/232789468/?rv=md1" target="_blank">53rd meetup</a>. In our get-togethers, we surface recent work in this Big Data space that benefits the entire development and user community. In case you missed this latest installment, or would like a recap, below you’ll find the three major topics we reviewed, complete with the videos and slide presentations. Feel free to keep the conversation going by sharing and/or asking us questions. We’ll get back to you!</p><b>Open Source Big Data Ingest with StreamSets Data Collector</b><br/><br/><figure class="tmblr-full tmblr-embed" data-provider="youtube" data-orig-width="560" data-orig-height="315" data-url="https://www.youtube.com/embed/suEYDxn6b2Q"><iframe width="560" height="315" src="https://www.youtube.com/embed/suEYDxn6b2Q" frameborder="0" allowfullscreen=""></iframe></figure><br/>Big Data tools such as Hadoop and Spark allow you to process data at unprecedented scale, but keeping your processing engine fed can be a challenge. Upstream data sources can “drift” due to infrastructure, OS, and application changes, causing ETL tools and hand-coded solutions to fail. <a href="https://github.com/streamsets" target="_blank">StreamSets Data Collector</a> (SDC) is an open source platform for building big data ingest pipelines that allows you to design, execute, and monitor robust data flows. In this session, StreamSets community champion Pat Patterson looks at how SDC’s “intent-driven” approach keeps the data flowing, whether you’re processing data “off-cluster,” in Spark, or in MapReduce.<br/><p></p><center><figure class="tmblr-full tmblr-embed" data-provider="unknown" data-orig-width="595" data-orig-height="485" data-url=""><iframe src="//www.slideshare.net/slideshow/embed_code/key/JMXxbh7GJf1OHJ" width="595" height="485" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC; border-width:1px; margin-bottom:5px; max-width: 100%;" allowfullscreen=""> </iframe></figure></center><br/><b>Better Together: Fast Data with Apache Spark and Apache Ignite</b><br/><br/><figure class="tmblr-full tmblr-embed" data-provider="youtube" data-orig-width="560" data-orig-height="315" data-url="https://www.youtube.com/embed/DabCtl8dPaI"><iframe width="560" height="315" src="https://www.youtube.com/embed/DabCtl8dPaI" frameborder="0" allowfullscreen=""></iframe></figure><br/>Spark and Ignite are two of the most popular open source projects in the area of high-performance Big Data and Fast Data. But did you know that one of the best ways to boost performance for your next-generation real-time applications is to use them together? In this session, Dmitriy Setrakyan, Apache Ignite Project Management Committee Chairman and co-founder and CPO at GridGain explains, in detail, how IgniteRDD — an implementation of native Spark RDD and DataFrame APIs — shares the state of the RDD across other Spark jobs, applications, and workers. Dmitriy also demonstrates how IgniteRDD, with its advanced in-memory indexing capabilities, allows execution of SQL queries many times faster than native Spark RDDs or Data Frames.<br/><p></p><center><figure class="tmblr-full tmblr-embed" data-provider="unknown" data-orig-width="595" data-orig-height="485" data-url=""><iframe src="//www.slideshare.net/slideshow/embed_code/key/IuEo03Pn0Sk3Zt" width="595" height="485" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC; border-width:1px; margin-bottom:5px; max-width: 100%;" allowfullscreen=""> </iframe></figure></center><br/><b>Recent Development in Apache Oozie</b><br/><br/><figure class="tmblr-full tmblr-embed" data-provider="youtube" data-orig-width="560" data-orig-height="315" data-url="https://www.youtube.com/embed/H-jy6AnUjNA"><iframe width="560" height="315" src="https://www.youtube.com/embed/H-jy6AnUjNA" frameborder="0" allowfullscreen=""></iframe></figure><br/>Yahoo Sr. Software Engineer Purshotam Shah gives the first part of this talk and describes the anatomy of a typical data pipeline and how Apache Oozie meets the demands of large-scale data pipelines. In particular, the talk focuses on recent advancements in Oozie for dependency management among pipeline stages, incremental and partial processing, combinatorial, conditional and optional processing, priority processing, late processing, and BCP management. The second part of this talk, given by Yahoo Software Engineer Satish Saley, focuses on out-of-the-box support for Spark jobs.<br/><p></p><center><figure class="tmblr-full tmblr-embed" data-provider="unknown" data-orig-width="595" data-orig-height="485" data-url=""><iframe src="//www.slideshare.net/slideshow/embed_code/key/yoEzbZ24nIm9U0" width="595" height="485" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC; border-width:1px; margin-bottom:5px; max-width: 100%;" allowfullscreen=""> </iframe></figure></center>