After working for a long time on many different distributed systems, you learn to appreciate how complicated the littlest things can become.
For example, when running an application on a local machine, changing configuration of an application involves clicking on a gui or, at worst, editing a file and restarting the app.
However, distributed applications run on different machines and need to see configuration changes and react to them.
To make matters worse, machines may be temporarily down or partitioned from the network.
Not only do these outages make things hard to configure, but they also make application health no longer a choice between dead or alive; you also have mostly alive or dead and the dreaded half dead.
Robust distributed applications also have the ability to incorporate new machines or decommission machines on the fly.
Partial failures together with elastic machine ensembles mean that even the configuration of the distributed application should be dynamic.
To make matters worse theoretical results such as the FLP proof (consensus is impossible with asynchronous systems and even one failure) and the CAP theorem (strong Consistency, high Availability, and Partition-tolerance: pick two, you can't get all three) mean that some compromises must be made.
As we were working with the different systems here at Yahoo, we had some very general requirements of distributed applications in this context.
First, applications need "ground truth"; they need an oracle, or service, that they can just believe without second guessing.
Second, the service needs to be as simple as possible, both to decrease the likelihood of bugs and make it as easy to understand as possible.
Third, the service needs to have good performance so that applications can use the service extensively.
If a developer goes to the trouble of integrating the service into their application, they should be able to make full use of it.
We designed ZooKeeper to meet these requirements.
(We already had a few distributed systems projects with animal names, and the term zoo conjures up a sense of chaos that tend to prevail in large systems.)
Our background in distributed file systems motivated a hierarchal namespace and file system like model.
That same background gave us insight into some of the features of file systems that are particularly hard to implement. (rename is the worst!)
We also thought that such a model would make it familiar to new developers since the file API is one of the earliest learned.
We added a couple of things such as the ability to set watches, callbacks, that will trigger on specific changes to files and ephemeral files which disappear if the client that created them disconnects (on purpose or due to failure) from ZooKeeper.
For reliability, we needed the service to be provided by a cluster of servers.
We have lots of clients and need high performance so we allow a client to connect to any active server in the cluster.
Since our initial target applications were very read dominant, we wanted read requests to be serviced by replicas without having to coordinate with other replicas.
We also didn't want to use locks for updates, both for the detrimental impact on performance, but also for the complications locks make on implementation.
So, we order all update requests and guarantee that all replicas see the same order of updates.
In the end all update requests are totally ordered and all reads are ordered with respect to update requests.
Of course, there are plenty of other details, but these were our key choices.
Our decision to focus on ordering turned out to be key to moving from a dynamic configuration service to a full fledged coordination service.
We had shown our production partners that things like configuration, leader election, group membership, and server status could all be done easily with ZooKeeper, but after hearing about a service from Google called Chubby they became convinced that the needed locks.
We looked at adding a lock method to our service, but we could not get it to fit nicely into our design.
Our implementation was completely wait-free, and adding a blocking method would require a complete redesign.
Soon we realized that by taking advantage of ordering and watches we could implement efficient locks at the client.
This lead us to start documenting some higher level primitives that can be implemented by clients without modifying the ZooKeeper server at all.
We call them recipes.
Once we got started, we realized we could do all sorts of coordination primitives like read-write locks, preemptable locks, queues, barriers, etc.
As an interesting postscript, the group that initially requested the locks ended up not needing them after all.
ZooKeeper's adoption into production was faster than we expected or could handle.
The first group we pitched the idea to immediately adopted it for their next project.
Eventually we stopped pitching ZooKeeper, because we needed to focus on implementation and we were getting too much interest.
We had written a prototype implementation, but the code base that is now ZooKeeper was implemented from scratch in a period of three months by a single developer.
Our initial users gave a lot of feedback and help get through a lot of the early bugs (thanks Mark and Zeke!).
Of course there have been many bugs that have been fixed since then and our developer base has grown to four.
By the end of the first year of ZooKeeper's development we open sourced it using SourceForge.
A few months after that we became an Apache subproject under Hadoop.
There are still plenty of things to do: partitioned namespace, more performance enhancements, higher level client primitives, etc.
So do you have a distributed system? (Yes, an application that runs on two machines counts!)
Want to make the world a better place?
Want to save humanity from utter chaos?
Join the discussion!
Apache ZooKeeper Committer