Sherpa update

In June 2009 we announced the launch of Yahoo!’s distributed Cloud storage platform, called Sherpa. Sherpa is the next-generation structured-record distributed storage service in Yahoo! that addresses scalability, availability and latency needs of Yahoo! websites.

Sherpa is horizontally scalable, and supports this scale elastically based on capacity demands. Sherpa is globally deployed within Yahoo! and data is replicated asynchronously (typically under a second) across data centers. Sherpa provides different consistency models, and applications can use consistency knobs and APIs to deal with data that is replicated asynchronously.

In the last year, we have added a number of interesting (and useful) features to Sherpa that has helped us serve the needs of Yahoo! better.


DOT: Distributed Ordered Tables

Many Web applications deploy MySQL (or some relational database) and tend to use it as a key-value store. As new NoSQL data-stores are becoming widespread in the open-source community, developers are moving their existing or new applications to these NoSQL data-stores. In Yahoo! as we help migrate such applications, the typical question is: Can Sherpa replace MySQL?

Yes, in certain cases. When data access is based on primary key (PK) alone, it is very easy to map SQL queries to Sherpa Web service requests. For example, a simple SQL statement for record retrieval would be “SELECT (*|col1,col2,…) FROM table WHERE pkcol = pkval;” The corresponding request in Sherpa is “$curl http://…/V1/get/table/pkval[field=col1&field=col2]”.

However, applications typically have more than just single key access patterns. The application may desire to access a set of records by the PK prefix. Or even have the results sorted on the PK. For example, retrieve the latest 10 status updates by user “P. P. S. Narayan”. Such queries are not (easily) possible using a distributed hash table. Hence, recently, we introduced distributed ordered tables (DOT) in Sherpa.

Using DOT, applications can access data clustered on the PK. For example, using the “$curl http://.../V1/ordered_scan/otable?start-key=value1&end-key=value2&order=desc” request, an application can get records in the PK range from “value1” to “value2”. In SQL this would be expressed as “SELECT * FROM otable WHERE pkcol BETWEEN “value1” AND “value2” ORDER BY pkcol DESC“. Similarly, “$curl http://.../V1/ordered_scan/otable?prefix=value-prefix&order=desc” can be used to in place of a LIKE query in SQL.

Under the covers, Sherpa ordered tables are sharded by the PK ranges rather than the hash ranges of the PK (in DHT). In the case of ordered tables, the distribution of primary keys will not be uniform; the sizes and load on the shards could vary dramatically. This makes sharding and load balancing a key challenge for DOT. So, we thought the best way to address this is by having an automated way of balancing the load via YAK.

YAK: Moving the load

Sherpa now has an automated load balancer, called YAK, which is responsible for detecting hotspots (highly loaded storage servers) and moving shards to lightly loaded servers. YAK does both load balancing and re-partitioning automatically, while tracking load across the hundreds of servers in multiple data centers. Each storage server keeps history of load/capacity metrics such as latencies, number of requests, error rate, number of records, and size; and keeps these across dimensions such as shard and table. Based on these metrics, YAK calculates the “heat” metric of the server as well as the heat of the shards on the server.

YAK has a rules-engine that uses the metrics to make load balancing decisions. The rules-engine picks servers that are hot and attempts to shift loads to “cold” servers by moving shards on-the-fly. Thus, it aims to keep the heat metric of the whole cluster near the average. In some cases, moving shards from one server to another may make the destination server too hot. So, YAK may decide to re-partition a shard into smaller units of movable load

Both load shifting and re-partitioning is done with minimal impact to ongoing application traffic. Our shard move protocol ensures that the new server has the latest copy (both from local updates and those via replication) before live traffic is routed from the old server to the new one.

YAK in action

Real-time Notification Streams

As seen in the preceding figure, a typical Web application architecture in a data center would store data in a key-value system (such as Sherpa). In addition, the architecture could include a caching layer (e.g., memcached), and an external index (e.g., Lucene) for full text search.

Application logic gets complex. Queries can be answered from one or more of the data stores. However, application code would need to make three update calls in the code flow to maintain three disparate data repositories — and, of course, deal with failure conditions of one or more of the stores. This problem gets more complicated in a globally replicated deployment. Applications need to maintain consistency across data centers for the cache and text index.

Sherpa makes life easy for application developers with Real-time Notification Streams. Applications can listen to the notification events on a table, and use it to maintain caches and external indexes. The Notification Stream is reliable, but delivered asynchronously (typically under one second) from the associated update. Asynchronous delivery allows low latency commits of updates, but a small gap (typically under a second in practice) for the update appearing on the cache or index. A key simplifying feature is local delivery: updates to a record in any data center is delivered to a local notification client in each data center. (We have implemented strict security controls to ensure only authorized clients can receive notification events of a table.)

Each notification event includes a payload indicating the values that have changed — both old and new value. Another useful feature is mandatory fields — fields that may not have changed — with each notification event. For example, consider the use case of maintaining a secondary index called “soccer_moms”. When fields of a table are updated, the application will be notified with the old value and new value of the respective field. However, to maintain the “soccer_moms” index, it is important that the “gender” field be “female” and “num_children” be greater than zero. Hence, receiving notification with just the updates on a single field cannot be used for index maintenance. Applications can thus request that both “gender” and “num_children” be included (even if one of them changes) on each notification event, in order to maintain the “soccer_moms” index.

More consistency alternatives

The first deployment of Sherpa supported the timeline-consistency model — namely, all replicas of a record apply all updates in the same order — and has API-level features to enable applications to cope with asynchronous replication. Strict adherence leads to difficult situations under network partitioning or server failures. These can be partially addressed with override procedures and local data replication, but in many circumstances, applications need a relaxed approach.

In late 2009, we introduced support for Eventual Consistency. Under this configuration, applications can perform inserts and updates on a table in the local replica with low latency. Conflicting updates from different replicas are resolved using a “last writer wins” policy at the field level (we call it merging) to ensure that replicas eventually converge to the same state. In the future, we are planning to introduce conflict notifications, to provide applications with the power to resolve such conflicts. Also, with eventual consistency, we maintain high availability in the event of server failures and network partitions. (A nice blog by Daniel Abadi talks more about this.)

What's ahead

In the months ahead, we plan to introduce several key features into Sherpa. A few that come to mind are secondary indexes, record-level adaptive replication, and very low latency access. It has been an extremely exciting time for Sherpa at Yahoo!. In the last year, we have onboarded numerous Yahoo! properties — such as YQL, Mobile, Yahoo! Social Platform (YOS), Advertising, MyYahoo!, Video, Sports, Shopping — into a multi-tenant, multi-datacenter system. In addition to bringing more users on to Sherpa, in the coming months Sherpa will grow to running on thousands of servers within Yahoo!.