Dash Open 08: Bullet - Open Source Real-time Query Engine for Large Data Streams

Transcript:
Rosalie Bartlett: Hi Everyone and Welcome to the Dash Open Podcast. Dash Open is your resource for interesting conversations about Open Source and other technologies, from the Open Source Program Office at Verizon Media. Verizon Media is the parent company of Yahoo, Huffington Post, AOL, Tumblr, TechCrunch and many other beloved internet brands. My name is Rosalie and I’m on the Open Source team at Verizon Media. Today on the show, I'm thrilled to chat with Nate Speidel, who is a Software Engineer on the audience data team at Verizon Media. Nate is also one of the creators of Bullet, which is an open source real-time query engine for very large data streams. Rosalie Bartlett: Nate has a Bachelor's Degree in Mathematics from the University of California Santa Cruz, as well as a Master's Degree in Computer Science from the University of California, San Diego. Welcome to the podcast, Nate. Nate Speidel: Thanks so much for having me. Rosalie Bartlett: Before we chat about this awesome project called Bullet, could you share a little bit about your focus at Verizon Media? Nate Speidel: Sure. I work on the audience data team. Audience data is all the user-engagement data that is generated on our websites and apps. If a user goes to a website and an app or clicks on a link or views a page, that would create a view event or a click event and then all that data is sent to our audience data pipeline. Our team is responsible for aggregating and processing all that data and then we generate tools that other teams within Verizon Media use to do things like A/B testing or machine learning, or management will use these tools to generate the key performance indicators that they use to make the design decisions or product decisions. Nate Speidel: One of the tools that we produce is a large-scale Hive database. So Hive, to a user, it's just like a SQL database. You can go run SQL queries, but Hive is on HDFS, so it's large-scale distributed data. We also produce some lower latency Kafka Streams that teams can plug into. On a day-to-day basis, I'm working with Hive, HDFS, and Oozie to do big-batch data processing and also Storm, Spark, and Kafka to do our low-latency streaming data accounts. Rosalie Bartlett: Can you maybe tell our listeners a little bit about Bullet? What problem does it solve and what inspired you to create it? Nate Speidel: Originally, Bullet was designed and created to solve a pretty specific problem that we had here at Yahoo (now Verizon Media) and we quickly realized that the problem we were facing was a specific instance of a more general problem. The problem we were facing is this - imagine you're a software engineer and it's your job to write the code that will be running on the front-end. It will be running on the web page or the app, that actually generates the data that's sent off to this audience data pipeline. We provide libraries to make that easy, but as a software engineer, you know, you write this code and it's supposed to be generating these events that are sent off through our audience data pipeline. Nate Speidel: Once you've done that, you want to be able to go and query the system somewhere and find the events you're generating in order to be able to validate that, first of all, that they arrived and second of all, that they are correct; that they have the right fields and the right values that you have engineered them to have. You want to be able to do that in an end-to-end way. It's pretty easy to test that locally, to capture events leaving your device, but you really want to be able to capture those events as they're coming out the end of the system into the Hive database and be able to check to see that they're correct in an end-to-end fashion. Nate Speidel: Before Bullet, there was really no easy way to do that. We did have low-latency feeds, like I mentioned, the Kafka Streams, but if you want a consumer from Kafka, in order to be able to find these events quickly, that's quite difficult. You have to write a Kafka consumer and you have to add your custom logic to filter out the events you're interested in and you have to run it somewhere and that was too time-consuming. Nate Speidel: More commonly, what teams would do is they would generate these events and then they would wait for them to surface in the Hive database and since that's a big batch process it would take on the order of an hour or two, and then they could go query the Hive database and find the events that they were interested in seeing, which were the events that they actually generated and checked to make sure that it was all correct. But engineers don't want to wait two hours to find their events. And a lot of times, if they check to see and it's not quite right and they go back and fix it, then to iterate, they would need to wait another two hours. Nate Speidel: There was just no easy way for them to find the actual events that they were generating and validate that they were correct. We needed a system that we could attach to the audience data pipeline and kind of put on top of the audience data pipeline, that would allow engineers to inject arbitrary queries and quickly find events with whatever custom filter criteria they have. And so that's exactly the problem that Bullet solves. Bullet is a pluggable, streaming query system, so you can plug it into any kind of streaming data system. Nate Speidel: And then it allows users, and perhaps many, many different users, to quickly and easily inject queries into the system and then Bullet filters the data as it's flowing through the system. So it solved the problem very nicely. Once we had Bullet set up, the engineers could create a query, say, filtering on their specific browser ID. When they launched the query, then it's in the system, then they can go to their front-end code that they've written and generate events and Bullet will capture them instantly and give them back to them and they can check to see that they're correct. If they're not, they can fix it and iterate very quickly. Rosalie Bartlett: Awesome. So it seems like there was a big problem and you guys needed a solution, and one didn't exist, so you built it. Nate Speidel: Yes, that's exactly right and Bullet has been quite popular internally. People use it a lot. Rosalie Bartlett: So what are some issues, other than the one you just discussed, that could be solved using Bullet? What exactly is required to run Bullet? Nate Speidel: Bullet is really designed to be able to quickly query and aggregate data. And because of the way it's designed, Bullet is always processing all the data and it always processes all the data exactly once, regardless of the number of queries in the system. So if you have a query in the system, then subsequent queries after that are nearly free of charge. So it scales really well with regard to the number of queries, which makes it perfect for multi-tenant applications. It's also great for aggregating data and kind of profiling the data as it's flowing through the streaming system, kind of on the fly. For example, recently we set up an instance of Bullet on Reddit data. Reddit provides an API where you can stream their comments in real-time. Nate Speidel: One of the fields that the Reddit API provides is a subreddit field. And so we did a top-K and we were able to kind of break down all the comments by their subreddit and see which subreddits were getting the most comments in real-time. Bullet does windowing, so we just ran the query for a long time and we had it send back results every two seconds. And so every two seconds, we could see a real-time update of how many comments were in each subreddit. As far as running Bullet, the back-end is currently written in Storm and Spark streaming, so you'd have to have Storm or Spark streaming set up to run the back-end of Bullet, which really does all the heavy lifting of Bullet. Nate Speidel: The web service is written in Spring Boot, but that compiles down to a jar, which you can just run anywhere. And then the UI is written in Amber, which is easy to run with node. So running is pretty straightforward, once you have Storm or a Spark streaming setup. The other thing I should mention is that the UI is one of the really great features of Bullet, I think. The UI provides a query-builder, which makes it really easy to use the query-builder, for example, you can use the graphical user interface to build your query if you are not comfortable with the SQL-like queries. Nate Speidel: They can just use the UI to create queries and it makes it very easy. The UI also has all kinds of great data visualization capabilities, like time series graphing. It has a pivot table and it has all kinds of great graphing capabilities. So I think the UI is one of the really great features of Bullet, as well. Rosalie Bartlett: Are there any services similar to Bullet that currently exist? Nate Speidel: Bullet is most commonly compared to the streaming SQL systems that are out there, like Kafka SQL, Flink SQL, and Spark streaming SQL. Those systems are good for launching arbitrary SQL-like queries and then the system, like say, for example, Kafka SQL, will take care of moving all the data, however it needs to move it and filter it, in order to execute that query and then it will return the result to you. Bullet differs from these systems, in two main ways. The first way is that all of these systems will effectively create and launch a whole new graph of data flow in order to move the data around and aggregate it in such a way to actually execute the query. Nate Speidel: And so each query you launch will be moving and filtering all the data for each query. So if you launch a Kafka SQL query, it does a lot of work to execute that query and then if you immediately launch another one, it's effectively going to do all that work again. It will be moving all the data again. It will be filtering all the data again. So these things don't scale real well with regard to the number of queries that you need to launch. So if you launch 10 Kafka SQL queries, it's going to do 10 times as much work as if you just launched one of them. Whereas, Bullet is filtering all the data just once and it's just extracting just the data it needs to satisfy whatever queries are currently in the system. So it scales much better with regard to the number of queries you can execute. Nate Speidel: If you're doing five Bullet queries and then you launch five more, those come almost free of charge with regard to the amount of computation power that's required to execute those queries. So it scales much better for the number of queries and the number of users. The other big difference is that Bullet uses another open source library called DataSketches, which allows it to solve some of the more computationally-difficult problems; problems like top-K, count-distinct and computing quantiles. Bullet can solve these problems very quickly using DataSketches and generally speaking, these other streaming creating systems can't do that yet. Rosalie Bartlett: All of that is very exciting, but even more so exciting is that Bullet is open source. Why did your team decide to open source Bullet? Nate Speidel: Well, I think a lot of the innovation that is happening at the cutting edge of technology is so complicated and technologically advanced, that being able to address these challenges as a community is absolutely essential. I don't think that these are problems that we'd be able to solve as a single team or as a single company. In that sense, I think that ultimately, the open source community is really the core of what makes all this incredible technology possible. I can really vouch for that because every day, we're using open source tools like Storm and HDFS and Kafka to do all the things we're doing. And without them, our capabilities wouldn't really be nearly what they are. So in that regard, I think the open source community is super important and I think it's super exciting to be a part of. Rosalie Bartlett: Absolutely! So Nate, for folks listening in today who are saying, "Whoa, Bullet seems really awesome." Where can they learn more about it? Nate Speidel: All of our code is available on GitHub. You can just go to GitHub and search for Yahoo Bullet and you'll be able to find it pretty quickly. We also have our main documentation on bullet-db.github.io. So you can check it out there. It has all the explanations about the high-level architecture and how it works. It also has links to videos and the code itself. Rosalie Bartlett: If folks want to connect with you, what's the best way for them to do that? Nate Speidel: The whole Bullet team can be reached at our Google Groups email address, which is bullet-users@googlegroups.com. Also if people want to find me, I'm Nathan Speidel on LinkedIn. And you can also email me. My work email is nspeidel@verizonmedia.com. Rosalie Bartlett: Well, Nate, it has been so great to chat with you today. Thank you so much for sharing about Bullet. Nate Speidel: Thank you very much. Rosalie Bartlett: If you enjoyed this episode of Dash Open and would like to learn more about open source projects at Verizon Media, please visit developer.yahoo.com. You can also find us on Twitter @YDN.

More Episodes: