Introducing YChaos - The resilience testing framework
<p>Shashank Sharma, Software Engineer, Yahoo</p>
<p>We, the resilience team, are glad to announce the release of <a
href="https://github.com/yahoo/ychaos">YChaos</a>, an end-to-end resilience testing framework to inject real time failures onto the systems and verify the system’s readiness to handle these failures. YChaos provides an easy to understand, quick to setup tool to perform a predefined chaos on a system.
</p>
<p>YChaos started as “Gru”, a tool that uses Yahoo’s internal technologies to run “Minions” on a predefined target system that creates a selected chaos on the system and restores the system to normal state once the testing is complete. YChaos has evolved a lot since then with better architecture, keeping the essence of Gru, catering to the use case of open source enthusiasts simultaneously supporting technologies used widely in Yahoo like Screwdriver CI/CD, Athenz etc.
</p>
<h2>Get Started</h2>
<p>The term chaos is intriguing. To know more about YChaos, you can start by installing the YChaos package</p>
<p><code>pip install ychaos[chaos]</code></p>
<p>The above installs the latest stable YChaos package (Chaos subpackage) on your machine. To install the latest beta version of the package, you can install from the test.pypi index</p>
<p><code>pip install -i https://test.pypi.org/simple/ ychaos[chaos]</code></p>
<p>To install the actual attack modules that cause chaos on the system, install the agents subpackage. If you are planning to create chaos onto a remote target, this is not needed.</p>
<p><code>pip install ychaos[agents]</code></p>
<p>That’s all. You are now ready to create your first test plan and run the tool. To know more, head over to our <a class="c6"
href="https://yahoo.github.io/ychaos/get_started/">documentation</a>
</p>
<h2>Design and Architecture</h2>
<p>YChaos is developed keeping in mind the Chaos Engineering principles. The framework provides a method to verify a system is in a condition that supports performing chaos on it along with providing “Agents” that are the actual chaos modules that inject a predefined failure on the system. The tool can also effectively be used to monitor and verify the system is back to normal once the chaos is complete.
</p>
<h3>YChaos Test Plan</h3>
<p>Most of the modules of YChaos require a structured document that defines the actual chaos/verification plan that the user wants to perform. This is termed as the test plan. The test plan can be written in JSON or YAML format adhering to the schema given by the tool.
</p>
<p>The test plan provides a number of attributes that can be configured including verification plugins, agents, etc. Once the tool is fed with this test plan, the tool takes this configuration for anything it wants to do going forward.
</p>
<p>If you have installed YChaos, you can check the validity of the test plan you have created by running
</p>
<p><code>ychaos testplan validate /tmp/testplan.yaml</code></p>
<h3>YChaos Verification Plugins</h3>
<p>YChaos provides various plugins within the framework to verify the system state before, during and after the chaos. This can be used to determine if the system is in a state good enough to perform an attack, verify the system is behaving as expected during the attack and if the system has returned back to normal once the attack is done.
</p>
<p>YChaos currently bundles the following plugins ready to be used by the users</p>
<ol>
<li>Python Module : Self configured plugin</li>
<li>Requests : Verify the latency of an API call</li>
<li>SDv4 : Remotely trigger a configured Screwdriver v4 pipeline and mark its completion as a criteria of verification.</li>
</ol>
<br />
<p>We are currently working on adding metrics based verification to verify a specific metric from the OpenTSDB server and to provide different criteria (Numerical and Relative) to verify the system is in an expected state.
</p>
<p>To know more about YChaos Verification and how to run verification, visit our <a class="c6"
href="https://yahoo.github.io/ychaos/verification/">documentation</a>. The documentation provides a way to configure a simple python_module plugin and run verification.
</p>
<h3>YChaos Target Executor</h3>
<p>
The target executor or just Executor is the one determining the necessary steps to run the Agent Coordinator. The target defines the place where the chaos takes place. Executor determines the right configuration to reach the actual target and thereby making the target available for Agent Coordinator to run the Agents
</p>
<p>Currently, YChaos supports MachineTarget executor to SSH to a particular host and run the Agents on it. The other targets like Kubernetes/Docker, Self are also under consideration.
</p>
<h3>YChaos Agent Coordinator</h3>
<p>The agent coordinator prepares the agents configured in the test plan to run on the target. It also takes care of monitoring the lifecycle of each agent so that all of the agents run in a structured way and also ensures the agents are teardown before ending the execution.
</p>
<p>The agent coordinator acts as a one point control of all the agents running on the target.
</p>
<h3>YChaos Agents (Formerly Minions)</h3>
<p>The agents are the actual attack modules that can be configured to create a specific chaos on the target. For example, CPU Burn Agent is specifically designed to burn up the CPU cores for a configured amount of time.
</p>
<p>The agents are bundled with an Agent Configuration that provides attributes that can be configured by the user. For example, CPU Burn Agent configuration provides the cores_pct which can be configured by the user to run the process on a percentage of CPU cores on the target.
</p>
<p>YChaos Agents are designed in such a way that it is possible to run them independently without any intermediates like a coordinator. This helps in quick development and testing of agents.
</p>
<p>Agents follow a sequence in their execution called the lifecycle methods like setup, run, teardown and monitor. The setup initializes the prerequisites for an agent to execute. Run actually contains the program logic required to perform a chaos on the system. Once the run executes successfully, the teardown can be triggered to restore back the system from the chaos created by that particular agent.
</p>
<h2>Acknowledgement</h2>
<p>We would like to thank all the contributors to YChaos as an idea, concept or code. We extend our gratitude to all those supporting the project from “Gru” to “YChaos”.</p>
<h2>Summary</h2>
<p>This post introduced a new Chaos Engineering or Resilience testing tool YChaos, how to get started with it and briefly discussed the design and architecture of the components that make up YChaos along with some quick examples to start your journey with YChaos with.
</p>
<h2>References and Links</h2>
<ol>
<li>YChaos Codebase : <a href="https://github.com/yahoo/ychaos">https://github.com/yahoo/ychaos</a></li>
<li>YChaos Documentation : <a href="https://yahoo.github.io/ychaos">https://yahoo.github.io/ychaos</a></li>
<li>Our Presence on PyPi</li>
<ol>
<li><a href="https://test.pypi.org/project/ychaos/">https://test.pypi.org/project/ychaos/</a></li>
<li><a href="https://pypi.org/project/ychaos/">https://pypi.org/project/ychaos/</a></li>
</ol>
</ol>