Omid Architecture and Protocol
<p><i>By Edward Bortnikov, Idit Keidar (Search Systems, Yahoo Labs), and Francisco Perez-Sorrosal (Yahoo Search)</i></p><hr><p>In our <a href="http://yahoohadoop.tumblr.com/post/129089878751/introducing-omid-transaction-processing-for" target="_blank">previous post</a>, we introduced <a href="https://github.com/yahoo/omid" target="_blank">Omid</a>, Yahoo’s new efficient and scalable transaction processing platform for Apache HBase. In this post, we first overview Omid’s architecture concepts, and then delve into the system design and protocols.<br/></p><p>Omid’s client API offers abstractions both for control and for data access. The abstractions are offered by a client library, and transaction consistency is ensured by a central entity called Transaction Status Oracle (TSO), whose operation we explain in the post. The Omid client library accesses the data directly in HBase, and interacts with the TSO only to begin, commit or rollback transactions. This separation between the control and the data planes is instrumental for system scalability. </p><p><b></b></p><p>For simplicity, we defer discussion of the TSO’s reliability and internal scalability to future posts; for now, let us assume that this component scales infinitely and never fails.</p><p><b><br/></b></p><p><b>Architecture Overview</b><br/></p><p><b></b></p><p>As we detailed in the <a href="http://yahoohadoop.tumblr.com/post/129089878751/introducing-omid-transaction-processing-for" target="_blank">previous blog post</a>, Omid provides a lock-free<a href="http://research.microsoft.com/apps/pubs/default.aspx?id=69541" target="_blank"> Snapshot Isolation</a> (SI) implementation that scales far better than traditional two-phase locking approaches. Namely, transactions can execute concurrently until commit, at which time write-write conflicts are resolved. Ties between two transactions that overlap both in time and in space (so committing both of them would violate SI) are broken by aborting one of them, usually the one that attempts to commit later.</p><p>The simplest way to break ties is via a central arbiter that serializes all commit requests and resolves conflicts based on this order. Distributed implementations, for example,<a href="https://en.wikipedia.org/wiki/Two-phase_commit_protocol" target="_blank"> two-phase commit</a>, are more expensive, complex, and error-prone. Omid therefore takes the simpler, centralized approach. It employs a centralized management service for transactions, the Transaction Status Oracle, or TSO, which coordinates the actions of clients. The main task of the TSO is to detect write-write conflicts among concurrent transactions, as needed for ensuring SI semantics. Similarly to any centralized service, the TSO is vulnerable to becoming a single-point-of-failure and a performance bottleneck. We will discuss Omid’s approach to high availability (HA) and scalability in a forthcoming post. Here, we describe only the operation of a single TSO in failure-free scenarios.<br/></p><p><b></b></p><p>Omid leverages the multi-versioning support in the underlying HBase key-value store, which allows transactions to read consistent snapshots of changing data as needed for SI. Specifically, when the item associated with an existing key is overwritten, a new version (holding the key, its new value, and a new version number) is created while the previous version persists. An old version might be required as long as there is some active transaction that had begun before the transaction that overwrote this version has committed. Though this may take a while, overwritten versions eventually become obsolete. Omid takes advantage of HBase’s <a href="https://blogs.apache.org/hbase/entry/coprocessor_introduction" target="_blank">coprocessors</a> to implement a garbage-collecting algorithm<a href="https://blogs.apache.org/hbase/entry/coprocessor_introduction" target="_blank"> </a>in order to free up the disk space taken up by such obsolete versions when doing <a href="https://hbase.apache.org/book.html#_compaction" target="_blank">compactions</a>.</p><p>In addition to storing the application data, HBase is further used for persistent storage of transaction-related metadata, which is accessed only by the transactional API and not exposed to the user, as will be described shortly.<br/></p><p>One of the functions of the TSO is generating version numbers (timestamps) for all client transactions. This is achieved via a subcomponent, the so-called Timestamp Oracle, that implements a central logical clock. In order to preserve correctness in shutdown/restart scenarios, the Timestamp Oracle maintains an upper bound (maximum timestamp) of this clock in a reliable repository, which can be either an HBase table or<a href="http://zookeeper.apache.org/" target="_blank"> a znode in Apache Zookeeper</a>.</p><p>The following diagram summarizes Omid’s system components and the interactions among them. Note that the TSO is only involved in the control path (for transaction begin/commit/rollback), whereas the Omid clients interact with HBase directly in the data path. This separation is paramount for scalability.</p><figure data-orig-width="975" data-orig-height="586" class="tmblr-full"><img src="https://66.media.tumblr.com/6035e913dbc52cf97720db9e94227d1a/tumblr_inline_nxf311eCNN1t17fny_540.png" alt="image" data-orig-width="975" data-orig-height="586"/></figure><p><b></b></p><blockquote><p><i><b>Fig 1:</b> Omid components. Omid clients use the TSO to create transactional contexts. Clients also allow to access data that resides in data tables in HBase transactionally. The TSO is in the control path for conflict detection when transactions are completed. Data is multi-versioned and a garbage-collecting coprocessor cleans up obsolete versions. The TSO and the Timestamp Oracle maintain some persistent and transient metadata in HBase (although it can be also stored in other storage systems).</i></p></blockquote><p><i><br/></i></p><p><b>Data and Metadata</b><br/></p><p><b></b></p><p>As noted above, user data resides in HBase and is multi-versioned. An item’s version number is the transaction identifier, txid, of the transaction that wrote it. The txid is returned by the TSO in response to a begin call.</p><p>Omid exploits HBase also for storing persistent metadata, which comes in two flavors. First, it augments each data item with a shadow cell, which indicates the commit status of the transaction that wrote it. Initially, when an item is written during a transaction, its shadow cell is set to tentative, i.e., potentially uncommitted. At commit time, the client obtains from the TSO the transaction’s commit timestamp (commit ts) and writes this timestamp to the shadow cells of its writeset, which contains all the items written by the transaction. In addition, Omid manages a commit table (CT) tracking the commit timestamps of transactions. The data maintained in the CT is transient, being removed by the client when the transaction completes.</p><p>The diagram below summarizes Omid’s data model and flow.<br/></p><figure data-orig-width="975" data-orig-height="571" class="tmblr-full"><img src="https://66.media.tumblr.com/653dd4a7bfce7c03a63ad0189d0af5a5/tumblr_inline_nxf404JoUi1t17fny_540.png" alt="image" data-orig-width="975" data-orig-height="571"/></figure><p><b></b></p><blockquote><p><i><b>Fig 2:</b> Omid data model. Clients use the TSO to obtain a transaction identifier (txid) when they begin a transaction, and a commit timestamp (commit ts) when they commit it. Data is stored in HBase with the txid as the version number, and the commit ts in a shadow cell. Before the commit ts is set, the written data is tentative, and the client consults the Commit Table to determine its status.</i></p></blockquote><p><br/></p><p><b>Transaction Protocol Overview</b><br/></p><p><br/>The begin API produces a unique txid, which is used by all subsequent requests. In Omid, txid also serves as the read (snapshot) timestamp. The commit API produces a commit ts, which determines the transaction’s order in the sequence of committed transactions. Both timestamps are based on a logical clock maintained by the Timestamp Oracle.<br/></p><p>Recall from our <a href="http://yahoohadoop.tumblr.com/post/129089878751/introducing-omid-transaction-processing-for" target="_blank">previous post </a>that SI allows transactions to appear to execute all reads at one logical point and all writes at another (later) point. In Omid, the txid is the time of the logical clock when the transaction begins, and it determines which versions the transaction will read; the commit ts, on the other hand, is the logical time during commit, and all the transaction’s writes are associated with this time (via the shadow cells).</p><p>Since transaction commit needs to be an atomic step, an application triggering a transaction: first tentatively writes all its information to data items without a commit timestamp in the corresponding shadow cells through the Omid client API (e.g. using transactional put or delete operations); then atomically commits it (in case there are no conflicts) via the TSO; and finally, once the client has received the commit ack from the TSO, it updates the shadow cells of these data items to include its commit ts. A transaction is considered complete once it finishes updating the shadow cells of all items in its writeset. Only at that point, the control is returned to the application.</p><p>The post-commit completion approach creates a window when the transaction is already committed but its writes are still tentative. However, a client may be delayed or even fail after committing a transaction and before completing it. Consider an incomplete committed transaction T. Transactions that begin during T’s completion phase obtain a txid that is larger than T’s commit ts, and yet during their operation, they may encounter in HBase data tables some items with tentative writes by T and others with complete writes by T. In order to ensure that such transactions see T’s updates consistently, Omid tracks the list of incomplete committed transactions in a persistent Commit Table (CT), which is also stored in HBase. Each entry in the CT maps a committed transaction’s id to its commit timestamp. The act of writing the (txid, commit ts) pair to the CT makes the transaction durable, regardless of subsequent client failures, and is considered the commit point of the transaction. </p><p>Transactions that encounter tentative writes during their reads refer to the CT in order to find out whether the value has been committed or not. In case it has, they help complete the write. This process is called healing, and is an optimization that might reduce accesses to the commit table by other transactions. </p><figure data-orig-width="950" data-orig-height="736" class="tmblr-full"><img src="https://66.media.tumblr.com/3be4620a079c9733bba39d5d23774398/tumblr_inline_nxf4c9gjly1t17fny_540.png" alt="image" data-orig-width="950" data-orig-height="736"/></figure><blockquote><p><i>Fig 3: Omid Transaction Flow (TX = Transaction; TS=Timestamp)</i></p></blockquote><p><b><br/></b></p><p><b>Client-Side Operation</b><br/></p><p><b></b></p><p>The Omid client depicted in the previous figure executes the following actions in the name of a client application using transactions:</p><p><b></b></p><p><i>Begin:</i> The client obtains from the TSO a start timestamp that exceeds all the write timestamps of committed transactions. This timestamp becomes the transaction identifier (<i>txid</i>). It is also used to read the appropriate versions when reading data. </p><p><b></b></p><p><i>Get(txid, key):</i> The get operation performs the following actions (in pseudo-code):</p><blockquote><p>scan versions of key that are lower than <i>txid</i>, highest to lowest</p><p> if version is not tentative</p><p> if its commit ts does not exceed txid, return its value</p><p>else, lookup the version’s id in CT </p><p> if present (the writing transaction has committed), </p><p> update the <i>commit ts</i> (healing process) </p><p> if commit ts does not exceed txid return the value<br/></p><p> else, re-read the version and return the value if no longer tentative and commit ts does not exceed txid. </p><p>In case no value has been returned, continue to the next version.</p></blockquote><p><b></b></p><p><i>Put(txid, key/value):</i> Adds a new tentative version of the key/value pair and the <i>txid</i> as the version. </p><p><b></b></p><p><i>Commit(txid, writeset):</i> The client requests commit from the TSO for its <i>txid</i>, and provides in <i>writeset</i> the set of keys it wrote to. The TSO assigns it a new commit timestamp and checks for conflicts for the transaction’s <i>writeset</i>. If there are none, it commits the transaction by writing the <i>(txid, commit ts)</i> pair to the CT. Then the TSO returns the control to the client providing also this <i>(txid, commit ts) </i>pair. Finally, the client adds the <i>commit ts </i>to all data items it wrote to (so its writes are no longer tentative) and deletes the <i>txid</i> entry from the CT.</p><p><b><br/></b></p><p><b>Summary</b></p><p>At Yahoo, we have recently <a href="https://github.com/yahoo/omid" target="_blank">open-sourced the Omid project</a>, a transactional framework for HBase. In this blog post we discussed the architecture and protocols behind Omid. We’ve shown the main components involved and we’ve described their interactions, which enable our framework to provide transactions on top of HBase in an efficient and scalable manner.<br/></p><p>The system has been implemented with HBase in mind, and therefore our presentation is in HBase terms. That said, the design principles are generic and database-neutral. Omid can be adapted to work with any persistent key-value store with multi-version concurrency control. </p>