Update on Traffic and Footprint
- Update on Traffic and Footprint
- Selective Record Replication
- Task Manager
- Storage Engines
- Dynamic Load Balancing
- Reliable Messaging
Sherpa was designed to serve as a massively scalable and elastic storage for structured data.
As such, Sherpa's adoption has grown dramatically at Yahoo! in the last two years.
Sherpa serves 100s of customer applications, having 500+ Sherpa tables and more than 50,000 Sherpa tablets(shards) in operation. A pipeline of about 50 applications in various pre-deployment stages indicate that Sherpa will be hosting even a larger set of applications within the next few months.
Currently, there are thousands of Sherpa servers installed in more than a dozen data centers around the world. More than one half of these data centers are outside the US. It is common for regional data centers to host hundreds of Sherpa servers. However, some regional data centers can have as few as a few dozen servers.
Load can vary drastically among hosted applications. One of Sherpa's active application tables receives more than 75,000 requests per second, and there exist tables that have far lower load profiles, too. Sherpa's automated load-balancing capabilities enable it to run on server farms composed of heterogeneous servers. Sherpa also supports banks of more homogeneous servers. For example, there are Sherpa banks that use SSD storage exclusively and others which only have machines with HDD. Banks allow Sherpa services to be differentiated to support a range of customers with a range of throughput and latency requirements.
Yahoo's monitoring systems allow Sherpa tenants to examine their latency and throughput SLA fulfillment and various other performance metrics such as operation-specific throughputs, average latencies and long-tail (90%, 95% and 99%) latencies for all the regions where their applications are installed. They can also view storage unit statistics and status.
Sherpa administrators interact with the system through the administrative interface exposed by Sherpa's tablet controllers. Most administrative tasks such as load-balancing and recovery have been automated. Sherpa helps administrators through dynamic and automated mechanisms for splitting and moving load, and service engineers have upgraded Sherpa multiple times while it has been in operation with high levels of service availability and data consistency.
Sherpa cross-datacenter replication meets Yahoo!'s business continuity needs and allows data to be available everywhere. Using selective replication, Sherpa meets jurisdictional requirements on data replication.
Selective Record Replication
Up to this point, replication in Sherpa has been at the granularity of Tables - i.e. a Table may be replicated to any subset of available data centers and all Records for that Table will be present in those locations. With the latest release, that granularity has been dropped to the Record level.
There are two primary motivations for this: efficiency and legality.
Efficiency: Transferring and storing data costs money, thus doing so unnecessarily wastes money. If a customer knows that a record is unlikely to be used in certain locations, it can avoid these costs. For user-oriented data, e.g. a user's profile, it is very likely to only be accessed from the location of that user and her friends/contacts.
Legality: Certain user data may not be allowed to reside in certain locations. Yahoo! must comply with the laws of varying countries where its data centers are located. These laws also change over time. Having a flexible system for defining record-level replication is important.
For locations that do not include a full copy of a record, Sherpa maintains a 'stub' of the record. The stub is only updated when a record is created, deleted, or when replication rules are changed. The presence of the stub helps to forward requests to the correct data centers where full copies exist. It can also help to satisfy 'negative' requests without performing a forward, where a caller only wants to determine the existence of a key, but does not need the data (think of a spam checker, trying to find out if email addresses are real or fake).
Determination of which locations contain a full copy of the record and which receive only a stub is done in a static, or declarative, fashion. When a table is created, a default rule is established for each data center. If a record is inserted without any replication policy declared, the table default rules are used. Normally records would be inserted with a declared policy, which will override the default rules for one or more locations. The record policy may also be changed at any point, and the record will be re-replicated as appropriate.
For both legal concerns and ease of maintaining enough copies (for recovery from failure) of each record, there is also the ability to override both the table default rules and the various record policies and force all records in a location to be either full or stub. These overrides are called 'required' rules.
When Sherpa recovers data from one location to another, it relies on the remote data center to contain all of the necessary data (obviously). With selective replication in play, this becomes a problem. One cannot restore a full copy of a record from a location that contains a stub. It is possible to imagine a system where recovery becomes widely distributed, hunting among all replicas to find a full copy of each record. Rather than implement this complex logic, Sherpa instead relies on requiring two locations to have full copies of all records. Given a failure, restoration can always occur from one of these full copies.
A full-table backup facility has been added. Up to this point, Sherpa has relied on redundant live copies to restore data lost due to machine failure, and has had no facility for recovering old versions of data. The Backup feature allows for multiple old copies of the full table to be saved in off-line storage. From this off-line storage customers may retrieve old versions of individual records.
Coming soon, customers will be able to fully restore an old version of an entire table, replacing the current live data. This can help in cases of multiple machine failures across regions, when the live system might fully lose some portion of a table, but it is more directly applicable for recovery from a client application problem. Applications may accidentally write bad data and need a mass 'undo' operation.
We are also working actively to support point-in-time recovery, which is scheduled for our next release.
On the internal side, Sherpa has added a general workflow manager to execute long-running tasks. The Backup and Restore operations, for instance, use this as it can take a long time (with intervening failures and retries required) to copy a 1TB table into or out of the system.
The Task Manager uses a Sherpa table for storage of task state, making it reliable and fault-tolerant. Moving forward, most administrative commands that require distributed operations will be run through the Task Manager.
Today, in each of its storage units, Sherpa uses MySQL?
/InnoDB as the "local" storage engine to manage its assignment of tablets and records therein. However, the use of MySQL?
/InnoDB is not essential to Sherpa's operations and the specifics of the storage engine have been elided by means of a well-defined storage API. We have been able to port, within a Hack Week, a significant portion of Sherpa's functionality to other storage engines such as BDB, BDB-Java edition, and a Log-Structured Merge (LSM) Tree developed by Yahoo Labs. We have also combined each of these technologies with advances in SSD technologies. Software and hardware technology combinations produce interesting variations in throughput/latency trade-off graphs, for varying data payloads, query types and varying read-write ratios. Tradeoffs in durability and recovery speed provide another set of factors which can affect throughput and latencies. We will give a more detailed account of our work in this area later. In the meantime, we would like to share some general aspects of the problems we are trying to solve with our storage engine work.
The graph below shows the average read-write ratio over 30 minute intervals in a sample storage unit over a span of some 4 days. The average read/write ratio experienced by this sample storage unit was 16:1, but it had also experienced ratios as low as 2:1 and as high as 34:1. It is important for our customers and for ourselves that we meet latency SLA under all these conditions.
As a multi-tenant service, Sherpa needs to support a wide range of read/write ratios because it hosts a mix of applications. For example, there are some applications that use high-throughput bulk load, followed by random reads in a read-heavy application. Others, like monitoring systems, can be write-heavy but read-light.
We continue to explore enhancements to our local storage engine. Given the mix of our customers and Sherpa's capabilities, we will be able to install a range of storage engine technologies, whether software or hardware, to optimally serve various customers.
Local storage engine work is critical because ultimately it helps define the economic bound on machine utilization. What we have discovered through our studies and exploration is that a storage engine perfectly suited to Sherpa does not currently exist but there is a ranking of suitability and we know the type of modifications and enhancements we need to make to increase such suitability.
Dynamic Load Balancing
As new storage units are commissioned, decommissioned, or failed over, tablets may be copied to new or other storage units. Tablets are also split and moved dynamically according to their size or level of hotness. The following graph shows the tablet distribution over a set of storage units in one Sherpa data center.
It is worth noting that for its in-production operations, Yahoo's Sherpa relies on a home-grown reliable messaging system for various guarantees, including reliable in-colo and cross-colo delivery of transactions.
This messaging system has evolved from earlier generations of Yahoo messaging systems that mediated replication updates for its massive user profile databases. Sherpa's messaging backbone helps with Sherpa's cross-colo replication as well as with ensuring lower update latencies and more robust transaction durability and recovery. It also plays a critical role in Sherpa's notification service. For an upcoming release, the messaging service has already integrated an automated failure recovery mechanism relying on Zookeeper.
We have ongoing improvements and enhancements that we host on special branches of our codeline and promote these enhancements incrementally to our internal customers at Yahoo. Some enhancements will take several iterations in our experimental branches to become fully robust.
We graduate mature ideas into Sherpa over time. Most immediately, secondary index support is being built into Sherpa, and will be available for tenant applications to use in the next release of Sherpa.
Other upcoming features also include full, cross-colo, and automatic table restoration from back-ups, bulk-upload (in particular from Yahoo's Hadoop data grid) as well as export capabilities.
We are also making our elastic load-balancing controller more robust. Sherpa can not only move tablets but it can also split them, prior to moving portions. In a sense, Sherpa is unique for its high levels of load balancing control and elasticity, as well as its ability to both move and split (re-quantize) its elasticity quanta. In upcoming releases, we intend to improve our load-balancing controllers and make them more robust in their dynamic and automatic operational control.
Authors: David Lomax, Masood Mortazavi, Suryanarayan Perinkulam, Toby Negrin, Sambit Samal