In the last release I took on the task of setting up a true system test environment for Apache ZooKeeper. Our previous environment ran the system test in a single JVM instance, which meant that there were some test scenarios that we just couldn't reproduce. In this new environment we wanted to be able to run tests across multiple hosts and deal with different numbers of machines and cluster environments.
My first attempt used ssh and scripts to fire off servers and clients on a cluster of machines. I soon realized that there was a significant amount of hardcoded configuration and envirnment
assumptions that would make the setup inflexible and impractical. So, I started over from scratch.
For the system test I wanted to use some set of machines, M, to host some number of ZooKeeper servers, S, and some number of clients C. Ideally the system test could just discover M at runtime, startup S servers on some subset of M and then startup C clients on the rest. I realized this is a perfect application for ZooKeeper. I also realized that such a use case can apply to more than just system tests.
What I needed to do was very similar to the rendezvous problem: you have two processes that startup with no knowledge of where each other is running and they need to find each other. Common solutions to this involve service discovery by broadcasting on well known ports, but for our network topologies broadcast based solutions will not work. But, ZooKeeper makes rendezvous easy. Each process creates an ephemeral znode with contact information as a child of a previously agreed upon znode (the rendezvous point). For example, let's use "/systest/available" as the rendezvous point. If four hosts startup, /systest/available would have the following children:
I can now see what machines are available for use in our system test.
Now that I know the machines that I can use, I want to start assigning work to them. The Java process that created /systest/available/host1 also created /systest/assignments/host1 and is watching that znode for children to appear. So when I need to assign task1 to a machine and assuming host1 is the least loaded machine, the system test will create the znode /systest/assignments/host1/task1, which will contain information about what class to instantiation and start as well as configuration parameters. The creation of /systest/assignments/host1/task1 will trigger a watch on the agent process running on that machine. The agent running on that machine will read the information about the
new task and start it. I also use the task znode to stop tasks, change their configuration, and get task status.
The end result is a very clean system test environment. The test classes themselves just interact through ZooKeeper, so there aren't any hardcoded assumptions about the environment needed. This also serves as a nice illustration of how Apache ZooKeeper can be used for more than just leader election and locking. Of course the next step would be to combine this scheme with something like OSGi to provide automatic management of the class dependences of the instances.