What is chaos engineering and why is it important?

Chaos engineering lets engineers implement faults and failures by intentionally breaking things in production and building more resilient systems.

What is chaos engineering?

Chaos engineering is the intentional act of injecting failure into production codebases and system configurations to measure how resilient they are to attacks and faults. It’s an exercise whose purpose is to build confidence that the system is strong enough to handle volatile conditions that may arise during production.

Principles of Chaos described chaos engineering as the practice of carrying out experiments like fault injections on a system, for the sake of building confidence in the system’s resilience and ability to withstand turbulent conditions in the production environment.

This article will describe the principles of chaos engineering, starting with its history and similar concepts in the next section.

The history of chaos engineering

The origins of chaos engineering as a concept date back to a database corruption incident at Netflix in 2008 that lasted three days and had a significant impact on operations. Following this incident, Netflix saw the need to migrate from their data center to the cloud and sought ways to prevent similar incidents in the new, cloud-based architecture. In the process, the aptly named Chaos Team at Netflix created the Chaos Monkey tool, and chaos testing engineering was born.

Chaos Monkey is an application that goes through a list of clusters, selects a random instance from each cluster, and turns it off without warning during work hours every workday. The type of failure Netflix engineers were dealing with was called vanishing instances, and every single instance had to be targeted.

Netflix engineers are known to work monotonously, therefore each engineer coupled resilient testing to their build. Netflix’s engineering motto, “freedom and responsibility,” means every engineer is free to build and roll out new features—but they must also take responsibility for maintaining these features. Chaos engineering gave Netflix the guarantee that a loss of an AWS instance would not affect the streaming experience.

Meanwhile, another concept called fault-oriented testing was in motion at IBM. While it is slightly different from chaos engineering as practiced today, they share the same underlying principles.

In the mid-2000s, Amazon went from being a monolith to a microservice to manage the site’s customer traffic better. In addition, they hired firefighter-turned-technology-entrepreneur Jesse Robbins, who took over the responsibility for website availability. He introduced a project named Game Days, fashioned to emulate parts of the training prospective firefighters have to endure.

Why cause chaos intentionally?

Let’s look at practical reasons why counterintuitively introducing chaos engineering practices to your software system is beneficial rather than detrimental.

Prevent outages and costs

The unavailability of software as a result of faults is estimated to cost companies millions of dollars. In 2017, 98% of companies surveyed by Rand Group stated that just one hour of downtime could cost their organization over $100,000.

A report by Gartner estimated losses between $140,000 and $540,000 per hour, resulting from a resource, state, network, or application unexpectedly crashing. The resulting operational costs may include:

Loss in revenue
Loss of productivity and efforts
Customer dissatisfaction
Damage to brand image
Low morale and loss of employee talent

Build confidence in managing complex production systems

The process of taking software applications to production carries several unknowns, given the complexity of the system and the deployment environment. When software reaches production, it faces real hostile stress that no amount of preproduction planning could prevent. Causing chaos in production is like learning to drive on a busy highway instead of a quiet country road.

The main reason behind chaos engineering is that it helps build confidence in these production systems. It’s not just about causing chaos—it’s also about detecting and finding the chaos in a system on time.

Beyond software testing

Chaos engineering is essential because traditional testing methods are often not enough in enterprise systems. Traditional software testing seeks to test known conditions just before production. The tests are binary: pass or fail. And when software doesn’t pass a test, the code is optimally refactored to make it pass next time.

Chaos engineering, however, goes beyond just testing software to see if it compiles or runs as expected—it tests unpredictable conditions in production.

Eliminating faults in systems

Another reason why chaos engineering is important is the elimination of dark debt. Dark debt refers to inherent faults in your production system and is reported to be found in complex systems yielding to complex system failures.

Chaos engineering weeds out these faults, as they are not discoverable during build time. As such, traditional testing alone cannot fix them.

In the next section, we will learn about the “gold standards,” or advanced principles of chaos engineering.

Principles of chaos engineering

Casey Rosenthal, an engineering manager on the Chaos Team at Netflix, surveyed Netflix engineers in 2015 to better understand their views on chaos engineering and whether it added value to adopters.

The survey resulted in a proper definition of chaos engineering and produced the gold standards for practicing the concept. This set of principles provides the basis for engineers to determine whether to practice chaos engineering and shows them how to do it properly.

The advanced principles of chaos engineering include:

Building a hypothesis on steady-state behavior
Varying events in the real world
Running experiments in production
Automating experiments for continuous running
Minimizing the blast radius

Hypothesis on steady-state behavior

Chaos engineering is said to have borrowed conceptualization from science and academia. For instance, to practice chaos engineering in a given system, you need to know how that system would behave in a natural state in production. In this light, chaos engineering is not unlike a scientific experiment.

A general form of the hypothesis for chaos engineering reads as follows: When x fails, the software or system will still be available for customers. This principle implores chaos engineers to focus on how the system is expected to behave and make comparisons accordingly.

Varying events in the real world

This principle encourages chaos engineers to make unknown variables as close to real-life events as possible. As a result, engineers may create variables more tethered to the user experience than the system’s engineering experience.

Running experiments in production

Experiments should be conducted to build confidence and resilience in the production environment. As scary as it sounds, chaos engineering targets the production environment, not the staging area. Most failure injections should occur in production, regardless of it being a critical environment that directly affects users.

Automating experiments for continuous running

This principle strives to ease the headaches of navigating complex systems and running chaos experiments over large sets of instances. In line with the second principle, the probable faults in a large and complex system might be too numerous to anticipate, making the need for automation critical.

Minimizing the blast radius

Making too great a damage in production may adversely affect customer traffic and create the exact scenario that chaos engineering seeks to prevent. Hence, it is imperative to have a defined parameter that the experiment’s impact can affect with minimal consequences.

Chaos engineering tools

Ever since the creation of Chaos Monkey and the establishment of a chaos engineering community, engineers have been building tools to aid and improve chaos injection in three categories of production environments:

The application’s state
Infrastructure resource
Network

Chaos in the application’s state

To introduce chaos to the application’s state, a dependent service can be removed from the development environment. For example, a running Docker container can be stopped or removed using a tool like Pumba, while the application is in production.

Another option is to test the application behavior in the absence of a dependency with Docker and kube-monkey. You can also mimic service outages or latency between service calls using service mesh tools like Istio or Chaos Monkey.

Infrastructure resource chaos

This category of chaos involves simulating the loss, stoppage, or failure of virtual instances (with Chaos Monkey) or availability zones or regions (with Chaos Kong). Other tools from Netflix include:

Chaos Lambda (for lower-scale chaos)
The Gremlin platform (offers chaos engineering as a service)
LitmusChaos
Janitor Monkey (now Swabbie)
Conformity Monkey

Network chaos

Running network chaos attacks ensures that applications don’t have a single point of failure. You can simulate network failures on the system or tamper with network connections to a reasonable extent.

You can experiment by degrading network connectivity to see how applications behave under low connectivity conditions—especially if it is a mobile application with offline functionality, or single-page applications that run without an internet connection.

Toxiproxy is a popular tool for running network attacks.

Implementing chaos engineering: A Kubernetes application example

When planning and preparing to inject chaos into a system, there is an established procedure to ensure what you’re doing is chaos engineering, and that you’re doing it well. The procedure comprises the following steps:

Step 1

The chaos team gathers to ask and answer specific questions based on previous incidents or concerns, such as, “Do we know how the system would behave if we did this?”

Step 2

Next, the team constructs a hypothesis for the experiment.

Step 3

Chances are, the system is complex. It is therefore mapped out to show the people, practices, processes, and functionalities involved. Example sketches and source codes to practice with a dummy application are available in the Chaos Toolkit Community Playground.

Mapping out the application involves showing timeouts, the persistence layers, as well as the platforms and service providers used in the development and production environments. Finally, it covers the infrastructural resources being adopted to make the software application run in a steady state. These resources may include virtual machines, actual machines, network systems, or servers.

Step 4

As the final step, the team references the sketches of the system. During team discussions, questions will invariably come up about what could fail, what has caused problems before, and what could be impacted if something fails. Building a likelihood-impact map may help identify relationships in systems.

Tooling up to automate chaos engineering with Chaos Toolkit

Most modern systems now constitute a complex architecture as more features are rolled out to suit the needs of a growing population of users. These systems also have to remain available to the users—even when a fault arises. Therefore, chaos engineering experiments have to be automated to take place at any time of the day, whether the engineers are at work or not.

The Chaos Toolkit is a popular ecosystem of extensions for carrying out chaos experiments. It’s open source and offers a variety of tools.

Chaos Toolkit uses an experiment format written in YAML or JSON. To access the toolkit, you must install the command-line interface (CLI) chaos. This CLI provides control over experiments by running them locally on your machine. The only dependency is the Python compiler (as the CLI is written in Python), so Python must be pre-installed. The installation process is available from the Chaos Toolkit documentation page.

After completion, the CLI makes the chaos command callable on the system. This allows the team to:

Identify different types of chaos with the chaos discover command and record their information
Initialize new chaos experiments with the chaos init command
Execute automated chaos experiments written in JSON or YAML files using the chaos run command
Run report to generate a human-readable report of the results

After installation and setup, create a Python virtual environment and move on to install the Chaos Toolkit with the command:

pip install chaostoolkit

To see the commands that are possible flags to the chaos command, run the help option:

chaos --help

The documentation page provides sample experiments. There is further information available on using Chaos Toolkit on Kubernetes as well.

Creating chaos in a Kubernetes application with kube-monkey

Kubernetes is a popular open-source tool software companies use to manage distributed systems. Given its popularity and wide adoption for production-grade software, we will use Kubernetes to provide an example of chaos engineering.

Kube-monkey and its operating method

Kube-monkey is the Kubernetes version of Chaos Monkey. It deletes Kubernetes pods in a cluster. Kube-monkey is preconfigured to run at 8 a.m. on weekdays, but it begins its operation of killing Kubernetes pods between 10 a.m. and 4 p.m.

The opt-in model is used for kube-monkey, so only scheduled Kubernetes apps can have their pods terminated. Here is an example of an opt-in deployment for killing one pod:


apiVersion: apps/v1 
kind: Deployment
metadata:
  name: terminapod
  namespace: app-namespace
spec:  template:
    metadata:
      labels:
        kube-monkey/enabled: enabled
        kube-monkey/identifier: terminapod
        kube-monkey/mtbf: '2'
        kube-monkey/kill-mode: "fixed"
        kube-monkey/kill-value: '1'
[... omitted ...]

During scheduling time, kube-monkey will:

Generate a list of eligible Kubernetes applications that have opted-in
Determine which pod should be killed
Determine when to kill a pod

During termination time, kube-monkey will:

Check if the Kubernetes applications are still opted-in and eligible for chaos experiments
Check if the application has updated its kill mode and kill value
Execute opted-in pods

To build a Docker container for kube-monkey, you can grab the image tag from the Docker Hub and build the container locally like this:

go get github.com/asobti/kube-monkey
cd $GOPATH/src/github.com/asobti/kube-monkey
make build
make container

You can configure Kube-monkey with environment variables or a TOML file at /etc/kube-monkey/config.toml with the relevant configuration keys. An example config.toml file is seen below:

[kubemonkey]
dry_run = true                           # Terminations are only
logged
run_hour = 8                             # Run scheduling at 8am
on weekdays
start_hour = 10                          # Don't schedule any pod 
deaths before 10am
end_hour = 16                            # Don't schedule any pod
deaths after 4pm
blacklisted_namespaces = ["kube-system"] # Critical apps live here
time_zone = "England/London"             # Set tzdata timezone
example. Note the field is time_zone not timezone

The equivalent in environment variables is:

KUBEMONKEY_DRY_RUN=true
KUBEMONKEY_RUN_HOUR=8
KUBEMONKEY_START_HOUR=10
KUBEMONKEY_END_HOUR=16
KUBEMONKEY_BLACKLISTED_NAMESPACES=kube-system
KUBEMONKEY_TIME_ZONE=England/London

You can use a config to test Kube-monkey in debug mode:

[debug]
enabled= true
schedule_immediate_kill= true

This would kill pods every 60 seconds of the start and end hours.

You can deploy with the conventional method:

kubectl apply -f km-config.yaml

Summary

With more and more organizations recognizing the importance of microservices and large-scale distributed systems, their systems are getting more complex and harder to understand. The complexity of a system can also result in dark debt, essentially nullifying the potency of traditional testing. Chaos engineering provides a potent solution to dark debt by identifying the hidden threats inherent in a system.

There may not be job openings for a “chaos engineer,” just like there are no roles for people who write code tests—it‘s usually every team member’s job. However, large organizations like Netflix may employ people to fulfill chaos engineering roles. Either way, a good working knowledge of chaos experimental procedures and tools can come in handy.

This article described chaos engineering in depth, including the concept’s history, the tools for implementing it, and a use case. Feel free to experiment with the tools provided here to determine which one is best suited to inject chaos into your organization’s system.

Sorry to hear that. Let us know how we can improve the article.

Chaos engineering: Injecting failure to test resilience