High Availability in Omid
<p><i>By Edward Bortnikov, Idit Keidar, Ohad Shacham (Search Systems, Yahoo Labs), and Francisco Perez-Sorrosal (Yahoo Search)</i><br/></p><hr><p>Omid, discussed in detail in our <a href="http://yahoohadoop.tumblr.com/post/129089878751/introducing-omid-transaction-processing-for" target="_blank">previous posts</a>, offers transactional access to data persistently stored in HBase. Here, we explain how Omid is made highly available (HA). Omid’s availability is obviously critical for smooth operation of its applications, and should thus not be inferior to the availability guarantees of the underlying HBase store. High availability is a brand new feature in Omid. </p><p>In very-high-end Omid-powered applications, the conjunction of Omid and HBase is expected to work round the clock, and exhibit a mean-time-to-recover (MTTR) of just a few seconds. Moreover, any measures taken for high availability should not hamper performance in the normal, fault-free, case.</p><p>Omid supports both HA and non-HA modes. The latter serves settings in which the system administrator prefers manual recovery, and a longer MTTR can be tolerated; for example, these can be non-critical infrastructures where the additional resources for running a backup TSO cannot be spared. </p><p><b><br/></b></p><p><b>High Availability via Primary-Backup Architecture</b><br/></p><p>Omid is designed around a centralized transaction processing service (transaction status oracle, or TSO), which is responsible for serializing transaction commit points and resolving inter-transaction conflicts. This design renders the TSO critical for the entire system’s availability. Our focus is thus on the high-availability architecture behind the TSO. As most HA solutions, it is expected to satisfy two requirements: (1) low MTTR, and (2) negligible impact on the system’s mainstream (failure-free) operation.<b><br/></b></p><p>Omid’s HA solution is based on the primary-backup paradigm: the TSO is implemented as a process pair consisting of a primary process and a backup process. The former serves all client requests, whereas the latter is in hot-standby mode, ready to take over if the primary fails. The process of transferring the client-serving responsibilities from the primary to the backup is called failover. Failure detection is timeout-based – namely, if the primary TSO does not re-assert its existence within a configured period, it is deemed failed, and the backup starts acting as a new primary. </p><p>Note that the primary and backup run independently on different machines, and the time it takes the primary to inform the backup that it is alive can be unpredictable due to processing delays (e.g., garbage-collection stalls, long I/O operations) and unexpected network failures. On the other hand, in order to provide a low MTTR, we cannot set the timeout conservatively so as to ensure that a live primary is never detected as faulty. We therefore have to account for the case that the backup performs a failover and takes over the service while the primary is operational. To this end, we use a Zookeeper object to track the current primary. The primary regularly re-asserts itself, unless it sees that it has been supplanted; the backup constantly tracks this object, and if the current primary becomes stale, updates the object to reflect the fact that it is now the primary.</p><p>The primary TSO advertises its identity to clients, also via Zookeeper. This way, the Omid library learns about the new primary upon failover and facilitates reconnection. Client applications must learn about the output of the pending commit requests to the old primary before retrying the transaction in order to avoid data corruption.</p><p><b><br/></b></p><p><b>The Failover Challenge</b><br/></p><p>A reliable system must honor all the operations successfully completed in the past, regardless of failovers. Namely, if a transaction receives a success response to its commit request, then future transactions must observe its updates. On the other hand, if a transaction aborts for whatever reason, then no future transaction should see its updates.<br/></p><p>In Omid, the TSO allocates monotonically increasing commit timestamps to committing transactions. In addition, when a transaction begins, it obtains a read timestamp, which reflects the commit timestamp of the last transaction to commit before it began. The transaction then reads the latest version of each object that does not exceed its read timestamp. As explained in our <a href="http://yahoohadoop.tumblr.com/post/129089878751/introducing-omid-transaction-processing-for" target="_blank">first post</a>, this yields a correctness semantics called snapshot isolation. The critical part of system state that affects correctness is the persistent commit table (CT), which reliably stores the mapping from transaction ids (txid) to commit timestamps (commit ts). The state recorded in the CT captures the system’s guarantee to its clients. As described in the <a href="http://yahoohadoop.tumblr.com/post/132695603476/omid-architecture-and-protocol" target="_blank">previous post</a>, a transaction is committed if and only if a (txid, commit ts) pair for it exists in the CT. Today, we will scrutinize this approach in a failure-prone world. </p><p>The key challenge faced by HA systems is known as split brain in the theory of distributed systems - the risk for conflicting updates occurring independently at distinct places. In primary-backup systems, split-brain manifests when the backup detects the primary as faulty whereas the latter is either still operating or the operations undertaken by it are in the process of taking effect. If treated naively, such lack of synchronization may lead to race conditions that ultimately affect the system’s correctness.</p><p>Let us take a closer look at this challenge now. There are scenarios in which the primary TSO can be falsely detected as failed, for example, due to a Java garbage collection stalls. The system therefore can end up with two concurrent TSO’s. The primary TSO therefore actively checks if a backup has replaced it, and if so, “commits suicide”, i.e., halts. However, it is still possible to have a (short) window between the failover and the primary’s discovery of the emergence of a new primary. </p><p>When a TSO fails, there may be some pending transactions that began with it (i.e., performed their begin transaction using this TSO) and did not commit (they might have either not attempted to commit yet, or may have attempted to commit with the old TSO, but the TSO did not complete logging their commit in the CT). Such pending transactions are deemed aborted. </p><p>To prevent new transactions from seeing partial updates of transactions handled by the old TSO, the new TSO needs to employ timestamps that exceed all those committed (or that might still be committed) by the old TSO. However, this separation is challenged by the potential concurrency of two TSOs. For example, if a TSO fails immediately after issuing a write to the CT that takes nonzero time, an old transaction may end up committing after the new TSO has begun handling new transactions. Unless handled carefully, this can cause a new transaction to see partial updates of an old one, as illustrated in the diagram below. To avoid this scenario, we must ensure that once a new transaction obtains a read timestamp, the commit/abort status of all transactions with smaller commit timestamps does not change. </p><figure data-orig-width="633" data-orig-height="440" class="tmblr-full"><img src="https://66.media.tumblr.com/94e140b60c0b5f921ffa2801af419453/tumblr_inline_o21if4DSNr1t17fny_540.png" data-orig-width="633" data-orig-height="440"/></figure><figure data-orig-width="468" data-orig-height="16" class="tmblr-full"><img src="https://66.media.tumblr.com/6fbbfa3e2d9fa3c613c571fc28f4b1c8/tumblr_inline_o21g7575ek1t17fny_540.png" alt="image" data-orig-width="468" data-orig-height="16"/></figure><p><br/></p><p>One way to address the above challenge is via mutual exclusion, that is, making sure that at most one TSO commits operations at a time. However, this solution would entail synchronization upon each commit, not only at failover times, which would adversely affect the system’s performance. We therefore forgo this option, and implement a different HA algorithm in Omid. This algorithm does not incur any penalty in failure-free scenarios.</p><p><b><br/></b></p><p><b>HA Algorithm</b><br/></p><p>The failover algorithm in Omid tolerates temporary overlap between the primary and backup TSO’s activity periods. To ensure correctness despite of such periods, we first have to ensure that the transactions committed by the old TSO and the new TSO are safely separated in time. Namely, (1) all the timestamps assigned by the new TSO exceed all those assigned by the old one, and (2) after a transaction with read timestamp tsr begins, no transaction that will end up with a commit timestamp tsc < tsr can update any additional data items (though it may still commit after this time). Beyond that, we have to allow the new TSO to safely figure out the status of pending transaction served by the old TSO. Recall from our <a href="http://yahoohadoop.tumblr.com/post/132695603476/omid-architecture-and-protocol" target="_blank">previous post</a> that in Omid, transactions write their updates tentatively before committing, and upon commit, update the written entries with their commit timestamp. Therefore, our failover mechanism has to ensure (3) when a transaction reads a tentative update, it can determine whether this update will be committed with a timestamp smaller than its read timestamp or not.</p><p>One way to meet (1) and (2) is to have the TSO publish the read timestamp it allots as part of initiating a transaction (e.g., via Zookeeper). Before committing, a TSO would check this location. If a timestamp greater than its last committed is detected, it would deduce that failover has happened, abort the transaction attempting to commit, and halt. This approach is plausible but would cast synchronization overhead on every begin and commit operation. Instead, the HA algorithm implemented in Omid uses locally-checkable leases. Leases are essentially locks that live for a limited time. With them, we can both detect TSO failures and allocate timestamp ranges in big chunks, thereby eliminating the synchronization overhead most of the time. </p><p>The challenge of meeting (3) is that transactions cannot consult the old TSO process, as it might have failed. In order to prevent in-flight writes of the old TSO to the CT from “changing the history” retroactively, we allow transactions served by the new TSO to proactively abort ones coming from the previous TSO. Specifically, when a read encounters a tentative update by a transaction that is not present in the CT, it forces that transaction to abort. We call this invalidation and illustrate it in the following figure. Invalidation is used judiciously only when failover might be taking place, as discussed in the next section of this post.</p><figure class="tmblr-full" data-orig-height="437" data-orig-width="639"><img src="https://66.media.tumblr.com/eb5bee642122ce89cb26a6a68a557824/tumblr_inline_o21ifp6uYy1t17fny_540.png" data-orig-height="437" data-orig-width="639"/></figure><figure data-orig-width="468" data-orig-height="16" class="tmblr-full"><img src="https://66.media.tumblr.com/c5dcb9da7ce5ec8c7e5e83573bce3d2a/tumblr_inline_o21g8h7ctA1t17fny_540.png" alt="image" data-orig-width="468" data-orig-height="16"/></figure><p><br/></p><p>Technically, the client performs the invalidation using an atomic read-modify-write (RMW) operation (put-if-absent flavor) to the CT, which adds an attribute to the CT record marking that the incomplete transaction has an “invalid” status. Any subsequent attempt to commit it (by adding it to the CT) will see this record, and thus fail. In addition, every read of a tentative update must check its invalid field in the CT, and ignore the update if the transaction has already been invalidated.</p><p><b><br/></b></p><p><b>Implementation Details<br/></b><br/></p><p>Let us now dive into some implementation details, and see how they guarantee the system’s safety. The TSO process pair, namely the primary and the backup, coordinate their actions via two shared Zookeeper znodes. One serves for allocating timestamp ranges called epochs. A TSO claims ownership of an epoch before allocating timestamps of this range to transactions. Upon failover, the new primary picks the next epoch in a way that ensures property (1) above.<b><br/></b></p><p>The second znode implements the lease. The lease is active for