developer

Dash Open 07: Oak - Open Source Scalable Concurrent Key-Value Map for Big Data Analytics

Transcript:
Paul Donnelly: Hi Everyone and Welcome to the Dash Open Podcast. Dash Open is your source for interesting conversations about open source and other technologies from the open source program office at Verizon Media. We're home to many leading brands, including Yahoo, Huffington Post, AOL, TechCrunch, and many more. My name is Paul Donnelly, and I'm a principal engineer at Verizon Media. Today on the podcast, I'm excited to chat with Eshcar, who's a Senior Research Scientist at Verizon Media, and Eddie who is a Director of Research. Eddie and Eshcar, can you talk about your focus at Verizon Media? Eddie Bortnikov: We are both part of the scalable systems team. We originally both come from Yahoo Research and we're physically located in Haifa, Israel. Our team specializes in developing distributed and parallel systems. This domain requires deep knowledge in environments and platforms in which many things happen in parallel and can obstruct each other. It requires a very specific kind of algorithmic thinking. On the product side, we are trying to help as many product partners as possible. Over the years we've contributed to a bunch of open-source technologies that are widely used in Yahoo, especially within the Hadoop stack, starting from the file system, the HBase key-value store, and the upper-level technologies. Paul Donnelly: We're here to talk about Oak, which was just recently open-sourced. Can you tell me what Oak stands for or how you came up with the name? Eddie Bortnikov: Oak is related in its context to the larger project, which this technology serves, which is the Druid project. Druid is an open source technology for in-memory real-time analytics, which is widely used at Verizon Media by a variety of analytics projects. And in this context, we thought at some point that a specific part of Druid that deals with high-speed data ingestion could be improved through a variety of algorithms. So we came up with this algorithmic idea of suggesting a concurrent data structure and when we started looking for a name, one of the folks on our team said, "Well, Druid is an ancient religion and Oak was a sacred tree for those guys. Why don't we use the name Off-heap Allocated Keys?”. Paul Donnelly: What is Oak? Eshcar Hillel: Oak is a scalable in-memory key values map that can ingest a very high volume of data very efficiently and also serve queries. It can store a variable size key-value which can be very big. For example, if you think of Druid, the keys are very big in Druid and may have multiple dimensions and the values themselves that are also very large and include several items or aggregators. So this is what Oak is. Paul Donnelly: Why was Oak created? Eshcar Hillel: Well, as Eddie mentioned, the need originally came from Druid. They have an incremental index which processes all the data in-memory and serves it for memory, but the current implementation is bounded by the size of the JVM. It can only grow to several 10s of gigabytes. And they came up with a need to be able to get some of the keys and value off-heap, which allows you to utilize a much larger portion of the memory. And this is what Oak does, it allows you to allocate your keys and values off-heap and it really releases you from the need to do the GC. And this also makes it more efficient. Eddie Bortnikov: At the end of the day, it's about doing more with less hardware. So imagine that we were using hundreds of machines to ingest the data, this Oak library actually allows you to do the same with dozens. Paul Donnelly: Now that Oak is open-source, who do you think would find it valuable? Eddie Bortnikov: Key value maps are a very basic construct in computer science and they are used everywhere. So that was part of our motivation to release this project as a separate technology that is used not only by Druid but by other software applications. We can't even envision every possible use. So right now the Druid community is our first customer and we are trying to make Oak as efficient as possible to serve the Druid needs. Paul Donnelly: Where can folks learn more about Oak or contribute to it? Eshcar Hillel: github.com/yahoo/oak Paul Donnelly: Thank you for talking with us, Eshcar and Eddie. If you enjoyed this episode of Dash Open and would like to learn more about open source and other technologies at Verizon Media, please visit developer.yahoo.com. You can also find us on Twitter @ydn.

More Episodes: