Authors: Anupam Seth, Robert Evans, and Thomas Graves, Core-Hadoop development team, Cloud Infrastructure Group.
Hadoop is widely used at Yahoo! to do all kinds of processing. It is used for everything from counting ad clicks to optimizing what is shown on the front page for each individual user. Deploying a major release of Hadoop to all 40,000+ nodes at Yahoo! is a long and painful process that impacts all users of Hadoop. It involves doing a staged rollout onto different clusters of increasing importance (e.g. QA, sandbox, research, production) and asking all teams that use Hadoop to verify that their applications work with this new version. This is to harden the new release before it is deployed on clusters that directly impact revenue, but it comes at the expense of the users of these clusters because they have to share the pain of stabilizing a newer version. Further, this process can take over 6 months. Waiting 6 months to get a new feature, which users have asked for, onto a production system is way too long. It stifles innovation both for Hadoop and for the code running on Hadoop. Other software systems avoid these problems by more closely following continuous integration techniques.
Groundhog is an automated testing tool to help ensure backwards compatibility (in terms of API, functionality, and performance) between releases of Hadoop before deploying a new release onto clusters with a high QoS. Groundhog does this by providing an automated mechanism to capture user jobs (currently limited to pig scripts) as they are run on a cluster and then replay them on a different cluster with a different version of Hadoop to verify that they still produce the same results. The test cluster can take inevitable downtime and still help ensure that the latest version of Hadoop has not introduced any new regressions. It is called groundhog because that way Hadoop can relive a pig script over and over again until it gets it right, like the movie Groundhog Day. There is similarity in concept to traditional fork/T testing in that jobs are duplicated and ran on another location. However, Hadoop fork testing differs in that the testing will not occur in real-time but instead the original job with all needed inputs and outputs will be captured and archived. Then at any later date, the archived job can be re-ran.
The main idea is to reduce the deployment cycle of a new Hadoop release by making it easier to get user oriented testing started sooner and at a larger scope. Specifically, get testing running to quickly discover regressions and backwards incompatibility issues. Past efforts to bring up a test cluster and have Hadoop users run their jobs on the test cluster has been less successful than desired. Therefore, fork testing is a method for reducing the human effort needed to get user oriented testing ran against a Hadoop cluster. Additionally, if the level of effort to capture and run tests is reduced, then testing can be performed more often and experiments can also be run. All of this must happen while following data governance policies though.
Thus, Fork testing is a form of end to end testing. If there was a complete suite of end to end tests for Hadoop, the need for fork testing might not exist. Alas, the end to end suite does not exist and creating fork testing is deemed a faster path to achieving the testing goal.
Groundhog currently is limited to work only with pig jobs. The majority of user jobs run on Hadoop at Yahoo! are written in pig. This is what allows Groundhog to nevertheless have a good sampling of production jobs.
This section gives an insight into the requirements that drove our overall architecture and detailed design of the solution. The primary requirement was for the grid team to be able to capture a representative set of jobs from a grid running an older, stable version of Hadoop (e.g., the 0.20.xxx line) for later playback on a test grid running the latest and greatest Hadoop software stack. The jobs must be archived and localized so they can be ran later and can be reran multiple times without influencing external sources or sinks of data (e.g. modify data in an Oracle DB or write into user output folders in monitored project spaces). In other words, the method for running archived jobs should have protection in place to prevent the jobs from accessing systems outside of the cluster. The jobs also need to be filtered so that they are reproducible, to a measurable degree, on another Hadoop cluster; and are small enough in size to be runnable in a few hours and not take up too much space on the test cluster.
Once enabled, the capture and filtering of jobs should be automated. The grid team should be able to enable and control capture to be completely off, capture a TBD subset or capture all jobs. The running of the tests should be able to be run automatically from the Yahoo Hadoop QA test framework and report back results to the same, but engineers, at the same time, should be able to manually run the archived tests for debugging purposes. Validating the output of a job should be tolerant of runtime specific data like timestamps changing. This is a challenging requirement that was not implemented successfully, and doing so in the future might require some kind of more rigorous statistical tooling to be put in place.
Ideally multiple tests should be able to be run simultaneously. This requires that the tests be isolated from one another. Data security and governance requirements must be honored (e.g. who can access and how long to keep). Capturing the jobs must have no performance, data or functional impact on jobs being ran on the source cluster. Grid ops must have some form of monitoring and control.
3. Architectural Overview
At Yahoo! pig is launched by a program called pig.pl. We modified pig.pl to also launch Groundhog after a pig script has finished successfully and Groundhog has been enabled.
Groundhog parses the pig command line and the pig script to find and capture the jobs inputs, including the pig script, user defined functions (UDFs), and cache archives. It rewrites the script to allow the inputs and outputs to be replaced with different values when the test is run. This provides isolation from other jobs running on the same system. It also takes a statistical snapshot of the scripts outputs for later comparison. From this Groundhog is able to detect changes in the output such as missing or mangled records. It can also detect APIs used by the script that have changed and are no longer backwards compatible.
Once the script is captured it is migrated to what we call an aggregation cluster. This is a logical cluster and does not have to be a physically separate cluster from where the test was captured. The test is then rerun on this grid a configurable number of times. This is to filter out tests that were not captured correctly or that do not have consistent behavior between runs.
Once the tests have been aggregated they can now be run on a test cluster. Groundhog pulls these tests from the aggregation cluster, runs them and then verifies that the results match bit for bit.
In this design we refer to three different clusters, the source cluster, the aggregation cluster and the test cluster. These clusters are logical clusters, not necessarily physical clusters. Any or all of them may be combined on a single physical cluster. If two logical clusters are combined together data that would be copied between the clusters may optionally be left in place. In order to comply with data governance policies only scripts that process data on a manually generated white list are captured. Periodically a cleaner script will also look at all captured tests and remove those that have data that is about to expire. Because we can only capture tests on a whitelist we are required to ask grid users to opt-into testing with Groundhog. This seriously impacts the number and range of scripts that we can capture.
4. Future outlook
Although Groundhog is potentially a great performance measurement tool, time constraints have limited our ability to use it as such so far. We have active plans to integrate features into it that would facilitate performance comparison numbers.
Groundhog also had several requirements that forced it to become more of a proof of concept then a full-fledged solution. The desired time frame left no way to put any changes into Hadoop itself. As of today, Hadoop does not currently capture enough information about each job in logs to be able to recreate them. Because of these reasons, Groundhog currently is limited to work only with pig jobs, but we plan on extending that to regular Map-Reduce jobs by adding hooks in Hadoop to allow saving off of sufficiently necessary information to be able to replay the jobs.
We are also planning to open source Groundhog in the near future so that other companies and individuals could utilize the efforts so far and contribute back into making this a useful test tool for the entire Hadoop community. This idea has already received initial support internally and externally.