Bullet Updates - Windowing, Apache Pulsar PubSub, Configuration-based Data Ingestion, and More
<p>By <a href="https://www.linkedin.com/in/akshai-sarma-9029b011/">Akshay Sarma</a>, Principal Engineer, Verizon Media & <a href="https://www.linkedin.com/in/brian-xiao-77276450/">Brian Xiao</a>, Software Engineer, Verizon Media<br/></p><p>This is the first of an ongoing series of blog posts sharing releases and announcements for <a href="https://bullet-db.github.io/">Bullet</a>, an open-sourced lightweight, scalable, pluggable, multi-tenant query system.<b><br/></b></p><p>Bullet allows you to query any data flowing through a streaming system without having to store it first through its UI or API. The queries are injected into the running system and have minimal overhead. Running hundreds of queries generally fit into the overhead of just reading the streaming data. Bullet requires running an instance of its backend on your data. This backend runs on common stream processing frameworks (Storm and Spark Streaming currently supported).</p><p>The data on which Bullet sits determines what it is used for. For example, our team runs an instance of Bullet on user engagement data (~1M events/sec) to let developers find their own events to validate their code that produces this data. We also use this instance to interactively explore data, throw up quick dashboards to monitor live releases, count unique users, debug issues, and more.</p><p>Since <a href="https://yahooeng.tumblr.com/post/161855616651/open-sourcing-bullet-yahoos-forward-looking">open sourcing Bullet in 2017</a>, we’ve been hard at work adding many new features! We’ll highlight some of these here and continue sharing update posts for future releases.</p><p><b>Windowing</b></p><p>Bullet used to operate in a request-response fashion - you would submit a query and wait for the query to meet its termination conditions (usually duration) before receiving results. For short-lived queries, say, a few seconds, this was fine. But as we started fielding more interactive and iterative queries, waiting even a minute for results became too cumbersome.</p><p>Enter windowing! Bullet now supports time and record-based windowing. With time windowing, you can break up your query into chunks of time over its duration and retrieve results for each chunk. For example, you can calculate the average of a field, and stream back results every second:</p><div style="text-align:center;"><figure class="tmblr-embed tmblr-full" data-provider="youtube" data-orig-width="540" data-orig-height="304" data-url="https%3A%2F%2Fwww.youtube.com%2Fembed%2FHKEkHnnq7Yo"><iframe width="540" height="304" id="youtube_iframe" src="https://www.youtube.com/embed/HKEkHnnq7Yo?feature=oembed&enablejsapi=1&origin=https://safe.txmblr.com&wmode=opaque" frameborder="0" allowfullscreen=""></iframe></figure></div><p>In the above example, the aggregation is operating on all the data since the beginning of the query, but you can also do aggregations on just the windows themselves. This is often called a <i>Tumbling</i> window:<br/></p><figure data-orig-width="936" data-orig-height="354" class="tmblr-full"><img src="https://66.media.tumblr.com/e20d505ff2dddf5e646f126523ea4f9a/tumblr_inline_pnyky3M0Gi1wxhpzr_540.png" alt="image" data-orig-width="936" data-orig-height="354"/></figure><p>With record windowing, you can get the intermediate aggregation for each record that matches your query (a <i>Sliding</i> window). Or you can do a <i>Tumbling </i>window on records rather than time. For example, you could get results back every three records:<br/></p><figure data-orig-width="1334" data-orig-height="514" class="tmblr-full"><img src="https://66.media.tumblr.com/a4c9350a85b8a29345ce92fe1498f91f/tumblr_inline_pnykzfHVar1wxhpzr_540.png" alt="image" data-orig-width="1334" data-orig-height="514"/></figure><p>Overlapping windows in other ways (Hopping windows) or windows that reset based on different criteria (Session windows, Cascading windows) are currently being worked on. Stay tuned! <br/></p><figure data-orig-width="946" data-orig-height="492" class="tmblr-full"><img src="https://66.media.tumblr.com/0e341961272f73dd68c0570ed3e9ac07/tumblr_inline_pnyl3nZiUB1wxhpzr_540.png" alt="image" data-orig-width="946" data-orig-height="492"/></figure><figure data-orig-width="950" data-orig-height="412" class="tmblr-full"><img src="https://66.media.tumblr.com/7c6699716b3e92703e23a9ef6cc1b3a3/tumblr_inline_pnyl47zEuK1wxhpzr_540.png" alt="image" data-orig-width="950" data-orig-height="412"/></figure><p><b>Apache Pulsar support as a native PubSub</b><br/></p><p>Bullet uses a PubSub (publish-subscribe) message queue to send queries and results between the Web Service and Backend. As with everything else in Bullet, the PubSub is pluggable. You can use your favorite pubsub by implementing a few interfaces if you don’t want to use the ones we provide. Until now, we’ve maintained and supported a REST-based PubSub and an<a href="https://kafka.apache.org/"> Apache Kafka</a> PubSub. Now we are excited to announce supporting <a href="http://pulsar.apache.org/">Apache Pulsar</a> as well! <a href="https://bullet-db.github.io/pubsub/pulsar/">Bullet Pulsar</a> will be useful to those users who want to use Pulsar as their underlying messaging service.<br/></p><p>If you aren’t familiar with Pulsar, setting up a local standalone is very simple, and by default, any Pulsar topics written to will automatically be created. Setting up an instance of Bullet with Pulsar instead of REST or Kafka is just as easy. You can refer to <a href="https://bullet-db.github.io/pubsub/pulsar/">our documentation</a> for more details.</p><figure class="tmblr-full"><img src="https://66.media.tumblr.com/ec94a8fdb017dda1b1a32af719684f3b/tumblr_inline_pnyr976xrW1wxhpzr_1280.png" alt="image"/></figure><p><b>Plug your data into Bullet without code</b></p><p>While Bullet worked on any data source located in any persistence layer, you still had to implement an interface to connect your data source to the Backend and convert it into a record container format that Bullet understands. For instance, your data might be located in Kafka and be in the Avro format. If you were using Bullet on Storm, you would perhaps write a Storm Spout to read from Kafka, deserialize, and convert the Avro data into the Bullet record format. This was the only interface in Bullet that required our customers to write their own code. Not anymore! Bullet DSL is a text/configuration-based format for users to plug in their data to the Bullet Backend without having to write a single line of code.</p><p><a href="https://bullet-db.github.io/backend/dsl/">Bullet DSL</a> abstracts away the two major components for plugging data into the Bullet Backend. A Connector piece to read from arbitrary data-sources and a Converter piece to convert that read data into the Bullet record container. We currently support and maintain a few of these - Kafka and Pulsar for Connectors and Avro, Maps and arbitrary Java POJOs for Converters. The Converters understand typed data and can even do a bit of minor ETL (Extract, Transform and Load) if you need to change your data around before feeding it into Bullet. As always, the DSL components are pluggable and you can write your own (and contribute it back!) if you need one that we don’t support.</p><p>We appreciate your feedback and contributions! Explore Bullet on <a href="https://github.com/bullet-db">GitHub</a>, use and help contribute to the project, and chat with us on <a href="https://groups.google.com/forum/#!forum/bullet-users">Google Groups</a>. To get started, try our Quickstarts on <a href="https://bullet-db.github.io/quick-start/spark/">Spark</a> or <a href="https://bullet-db.github.io/quick-start/storm/">Storm</a> to set up an instance of Bullet on some fake data and play around with it.</p>