developer

Dash Open 12: Apache Storm 2.0 - Open Source Distributed Real-time Computation System

Transcript:
Rosalie Bartlett: Hi Everyone and Welcome to the Dash Open Podcast. Dash Open is your source for interesting conversations about open source and other technologies from the open source program office at Verizon Media. Verizon Media is home to many leading brands like Yahoo, Huffington Post, AOL, Tumblr, TechCrunch and many more. Rosalie Bartlett: My name is Rosalie and I'm on the open source team at Verizon Media. Today on the show, I'm chatting with Kishor Patil. Kishor is a Principal Software Systems Engineer at Verizon Media. Prior to joining Verizon Media, Kishor held engineering roles at OnDeck Capital, Merrill Lynch and Goldman Sachs. Kishor is also an Apache Storm Committer and Project Management Committee member. Welcome to the Dash Open Podcast, Kishor! Kishor Patil: Hey, Thank you, Rosalie. Nice to talk to you. Rosalie Bartlett: Thank you so much for making the time today, so excited to chat with you. So Kishor, to start off our chat, could you maybe talk a little bit about your focus at Verizon Media? Kishor Patil: I'm a senior software developer working on the stream platform, more specifically Apache Storm and Kafka. That's the focus. We support all the pipelines and products within Verizon Media. Rosalie Bartlett: I know that you're working on a lot of exciting things here at Verizon Media. But even more exciting is that you're an Apache Storm committer and you're also a Project Management Committee member. And for folks who might not be familiar with Storm, could you maybe explain what it is? Kishor Patil: Apache Storm is a streaming platform that allows streaming and stream processing and a lot of tools coming in streaming from different systems such as your websites, maybe it could be internet of things or payments systems or a banking system where a lot of transactions get generated and the events flow in terms of streams and they need to be processed and a lot of questions that could be answered, especially with the explosion of big data. Kishor Patil: Hadoop is batch-oriented where a lot of data is captured, stored and then we do processing. But as the events are getting generated and a lot of data is getting generated on these systems, a lot of questions could be answered on the fly, such as, for example, fraud detection or other systems where immediate response to the user is useful. Kishor Patil: Batch processing is one aspect of Hadoop, but Storm is another aspect which is stream processing and immediate response. That completes the Lambda architecture that we talk about a lot. Rosalie Bartlett: And how many years have you been involved with the Storm community? Kishor Patil: I have been involved with Storm for almost five and a half years now. Rosalie Bartlett: Earlier, you mentioned some ways that folks could use Storm and I'm very curious - what are some of the ways that Verizon Media is using Storm? Kishor Patil: Sure. Verizon Media uses Storm in many ways. One of the examples is our Ad Platforms use it a lot to capture ad events and process them. There is also monitoring that happens that we use for monitoring application level, as well as, system-level events and performance. Those events that are generated by all these systems flow into a centralized system where they can be analyzed, the data could be viewed and graphed, and all of these events are processed using Storm as a stream processing engine where the pipeline is set up, so all these events coming from different systems get collected and processed and stored in different databases. Kishor Patil: Another example is we have Flickr and other products which are basically using events that are stored and do machine learning events on top of it such as tagging pictures with different tags once users have uploaded them. That is another use-case. Kishor Patil: The third use-case is web crawling, which requires a lot of processing as well as caching. So, we have a combined set up where all sorts of web crawling events are processed and stored simultaneously. And all these event-based streams are managed using Storm pipelines and the Storm platform. Rosalie Bartlett: Wow. So, Storm is heavily used. You've been with the Storm community, using Storm, for more than five years. And the reason that's interesting is because you've probably seen a lot of change. Today, we're going to be talking about Storm 2.0. What's happening in Storm 2.0? What's new? Kishor Patil: So, Storm 2.0 is a very interesting change. A couple of those changes are primarily performance enhancement and there are primary numbers. Kishor Patil: We planned to run benchmarks, but primary numbers suggest 50% to 80% improvement on sample use-cases in terms of latency. And that is a very promising change that, under the hood, we have changed queue in the infrastructure and we have added true back pressure, which enables events to flow at much faster speeds through the network and that requires minimum effort from the users, but it gives a lot of value to the end-users of the systems. So, that's one of the major changes that is coming in. Kishor Patil: Other changes are associated with resource-aware scheduling and more specifically, I would like to talk about generic resource awareness. A lot of scheduling complexities can be addressed to give better benefit and value, as well as utilization, for the hardware that is being used. Kishor Patil: The scheduler is aware of the environment or the cluster in which it is running and the resources that are available and the better it can understand the requirements from the people submitting the jobs. The balancing act can give you better latency and better value for the hardware that is being utilized. Kishor Patil: One of the keys is generic resources such as network or GPUs. If you have special use cases and some other specific hardware that we might have and we want to utilize, it can be added now and can be utilized as an extra resource apart from CPU and memory and that makes it a better aware scheduler and that can give you better performance as well as better value, especially utilization wise. It will give better value and save money on the cost of operations. Rosalie Bartlett: And with Storm 2.0, what types of contributions did your team make to it? What was Verizon Media's involvement? Kishor Patil: Generic resource-aware scheduler, which we've already talked about has been one of the key contributions, as well as, performance enhancements, security enhancements that we were part of it and also enabling backward compatibility. Kishor Patil: Backward compatibility was very important from a Verizon Media perspective. So, what we needed was a smooth transition and that's where backward compatibility was one of the focuses. What that enables us to do is to migrate the clusters and give individual users, pipeline owners, the ability to switch at their own pace and start benefiting from improvements and performance enhancements that 2.0 provides. This contribution back into the community is also going to help adaptation and upgrades for all other teams that utilize it in the community. Rosalie Bartlett: And what is it like to work on open source software? To you, what makes it special? Kishor Patil: So to me, it's a natural evolution of any community, from doing things on their own versus start sharing and contributing and the community grows at a better pace. Kishor Patil: It all started with people sharing libraries in the community through Maven Repository versus now sharing code and sharing platforms or frameworks that people are using. So, what actually interests me is anything that we develop is shared and we, as a community, can come together, take on a specific problem, collaborate and come up with a better solution as well as improve on a given product for applying the same product for multiple use-cases such that the longevity of a product goes much higher verses a small team doing something in isolation. Working in a specific area, product doesn't grow as generically and does not apply to many use-cases. So, the product starts to become out of fashion very soon and out of use as well because products designed for one specific use-case tend to be a lot of repetition and reinventing the wheel. So, that's what makes me happy to work in the open-source community. Rosalie Bartlett: And for folks listening in, who are thinking, "Wow, open source sounds awesome," what advice do you have for them if they want to start getting involved in the open source community? How should they start? Kishor Patil: I'm sure you're using more than one of the open source projects if you are a developer listening to this. And one of the best parts of getting involved in the open source community is a very low-entry barrier. You like some product, you start looking into the product, you have a problem and you're using a specific product, you can start reporting it. You can start reporting bugs about the product or library that you're using. You can start looking at the code as to why it’s behaving in a certain fashion and that's an entry point. You start reporting and then soon you start fixing things because once you are using the product and you have specific ways you think the product should work, you can start contributing back and start sharing your patches into the community and the code is always available in open source on GitHub and people can come and contribute, comment on each other's suggestions and you can start commenting on those suggestions. So that's how you can get in and start contributing to products that you like. Rosalie Bartlett: Really great advice. And if folks want to connect with you or if they want to learn more about Storm 2.0, where should they go? Kishor Patil: Storm.apache.org is a great place to start. My email ID is kishorvpatil@apache.org. You can always reach out to me. There is a developer.yahoo.com website where you can go and start reporting or contributing. There is a Twitter feed at @ApacheStorm. Feel free to chime in. Rosalie Bartlett: Awesome. Well, Kishor, it has been so great to chat with you today. Storm 2.0 sounds very exciting, so I just wanted to say thank you. Kishor Patil: Thank you, Rosalie. It was nice talking to you. Gil Yehuda: If you enjoyed this episode and you wanted to learn more about our open source program at Verizon Media or other technologies that we have available, please visit us at developer.yahoo.com. You can also find us on Twitter at YDN.

More Episodes: