HBase Goes Fast and Lean with the Accordion Algorithm
<p><a href="https://yahooresearch.tumblr.com/post/161742308886/hbase-goes-fast-and-lean-with-the-accordion" class="tumblr_blog" target="_blank">yahooresearch</a>:</p><blockquote>
<p>By Edward Bortnikov, Anastasia Braginsky, and Eshcar Hillel</p>
<figure class="tmblr-full" data-orig-height="500" data-orig-width="500"><img src="https://66.media.tumblr.com/08dfd7c77412ad46d26a56346b08fb77/tumblr_inline_orfajleJh01rgj0aw_540.gif" data-orig-height="500" data-orig-width="500"/></figure><p>Modern products powered by NoSQL key-value (KV-)storage technologies exhibit ever-increasing performance expectations. Ideally, NoSQL applications would like to enjoy the speed of in-memory databases without giving up on reliable persistent storage guarantees. Our Scalable Systems research team has implemented a new algorithm named Accordion, that takes a significant step toward this goal, into the forthcoming release of Apache HBase 2.0.</p>
<p><a href="https://hbase.apache.org/" target="_blank">HBase</a>, a distributed KV-store for Hadoop, is used by many companies every day to scale products seamlessly with huge volumes of data and deliver real-time performance. At Yahoo, HBase powers a variety of products, including Yahoo Mail, Yahoo Search, Flurry Analytics, and more. Accordion is a complete re-write of core parts of the HBase server technology, named RegionServer. It improves the server scalability via a better use of RAM. Namely, it accommodates more data in memory and writes to disk less frequently. This manifests in a number of desirable phenomena. First, HBase’s disk occupancy and write amplification are reduced. Second, more reads and writes get served from RAM, and less are stalled by disk I/O. Traditionally, these different metrics were considered at odds, and tuned at each other’s expense. With Accordion, they all get improved simultaneously.</p>
<p>We stress-tested Accordion-enabled HBase under a variety of workloads. Our experiments exercised different blends of reads and writes, as well as different key distributions (heavy-tailed versus uniform). We witnessed performance improvements across the board. Namely, we saw write throughput increases of 20% to 40% (depending on the workload), tail read latency reductions of up to 10%, disk write reductions of up to 30%, and also some modest Java garbage collection overhead reduction. The figures below further zoom into Accordion’s performance gains, compared to the legacy algorithm.</p>
<p><a href="https://s.yimg.com/ge/default/691231/Accordion1.png" target="_blank"></a></p>
<a href="https://s.yimg.com/ge/default/691231/Accordion1.png" target="_blank"><figure class="tmblr-full" data-orig-height="276" data-orig-width="528" data-orig-src="https://s.yimg.com/ge/default/691231/Accordion1.png"><img src="https://66.media.tumblr.com/e965d97feebea0c7f99e3e049e083717/tumblr_inline_p7g6lbrmdn1t17fny_540.png" alt="image" data-orig-height="276" data-orig-width="528" data-orig-src="https://s.yimg.com/ge/default/691231/Accordion1.png"/></figure></a><center><address dir="ltr">Figure 1. Accordion’s write throughput compared to the legacy implementation. 100GB dataset, 100-byte values, 100% write workload. Zipf (heavy-tailed) and Uniform primary key distributions.</address></center>
<br/><p><a href="https://s.yimg.com/ge/default/691231/Accordion2.png" target="_blank"></a></p>
<a href="https://s.yimg.com/ge/default/691231/Accordion2.png" target="_blank"><figure class="tmblr-full" data-orig-height="238" data-orig-width="388" data-orig-src="https://s.yimg.com/ge/default/691231/Accordion2.png"><img src="https://66.media.tumblr.com/334e9016e944e7a1b1e5c7722dc042e1/tumblr_inline_p7g6lbi6fK1t17fny_540.png" alt="image" data-orig-height="238" data-orig-width="388" data-orig-src="https://s.yimg.com/ge/default/691231/Accordion2.png"/></figure></a><center><address dir="ltr">Figure 2. Accordion’s read latency quantiles compared to the legacy implementation. 100GB dataset, 100-byte values, 100% write workload. Zipf key distribution.</address></center>
<br/><p><a href="https://s.yimg.com/ge/default/691231/Accordion3.png" target="_blank"></a></p>
<a href="https://s.yimg.com/ge/default/691231/Accordion3.png" target="_blank"><figure class="tmblr-full" data-orig-height="217" data-orig-width="447" data-orig-src="https://s.yimg.com/ge/default/691231/Accordion3.png"><img src="https://66.media.tumblr.com/79517f4b4f6d30493366569b87144272/tumblr_inline_p7g6lblWky1t17fny_540.png" alt="image" data-orig-height="217" data-orig-width="447" data-orig-src="https://s.yimg.com/ge/default/691231/Accordion3.png"/></figure></a><center><address dir="ltr">Figure 3. Accordion’s disk I/O compared to the legacy implementation. 100GB dataset, 100-byte values, 100% write workload. Zipf key distribution.</address></center>
<br/><p>Accordion is inspired by the<a href="https://en.wikipedia.org/wiki/Log-structured_merge-tree" target="_blank"> Log-Structured-Merge (LSM) tree</a> design pattern that governs HBase storage organization. An HBase region is stored as a sequence of searchable key-value maps. The topmost is a mutable in-memory store, called MemStore, which absorbs the recent write (put) operations. The rest are immutable HDFS files, called HFiles. Once a MemStore overflows, it is flushed to disk, creating a new HFile. HBase adopts<a href="https://blogs.apache.org/hbase/entry/apache_hbase_internals_locking_and" target="_blank"> multi-versioned concurrency control</a> – that is, MemStore stores all data modifications as separate versions. Multiple versions of one key may therefore reside in MemStore and the HFile tier. A read (get) operation, which retrieves the value by key, scans the HFile data in BlockCache, seeking the latest version. To reduce the number of disk accesses, HFiles are merged in the background. This process, called <i>compaction</i>, removes the redundant cells and creates larger files.</p>
<p>LSM trees deliver superior write performance by transforming random application-level I/O to sequential disk I/O. However, their traditional design makes no attempt to compact the in-memory data. This stems from historical reasons: LSM trees were designed in the age when RAM was in very short supply, and therefore the MemStore capacity was small. With recent changes in the hardware landscape, the overall MemStore size managed by RegionServer can be multiple gigabytes, leaving a lot of headroom for optimization. </p>
<p>Accordion reapplies the LSM principle to MemStore in order to eliminate redundancies and other overhead while the data is still in RAM. The MemStore memory image is therefore “breathing” (periodically expanding and contracting), similarly to how an accordion bellows. This work pattern decreases the frequency of flushes to HDFS, thereby reducing the write amplification and the overall disk footprint. </p>
<p>With fewer flushes, the write operations are stalled less frequently as the MemStore overflows, and as a result, the write performance is improved. Less data on disk also implies less pressure on the block cache, higher hit rates, and eventually better read response times. Finally, having fewer disk writes also means having less compaction happening in the background, i.e., fewer cycles are stolen from productive (read and write) work. All in all, the effect of in-memory compaction can be thought of as a catalyst that enables the system to move faster as a whole. </p>
<p>Accordion currently provides two levels of in-memory compaction: <b>basic</b> and <b>eager</b>. The former applies generic optimizations that are good for all data update patterns. The latter is most useful for applications with high data churn, like producer-consumer queues, shopping carts, shared counters, etc. All these use cases feature frequent updates of the same keys, which generate multiple redundant versions that the algorithm takes advantage of to provide more value. Future implementations may tune the optimal compaction policy automatically. </p>
<p>Accordion replaces the default MemStore implementation in the production HBase code. Contributing its code to production HBase could not have happened without intensive work with the open source Hadoop community, with contributors stretched across companies, countries, and continents. The project took almost two years to complete, from inception to delivery. </p>
<p>Accordion will become generally available in the upcoming HBase 2.0 release. We can’t wait to see it power existing and future products at Yahoo and elsewhere.</p>
</blockquote>