A company provides its software to meet the demands of its users. Typically, a company’s goal is never to let their software crash; it needs to be available whenever a user wants it. Software requires resources to run, and chaos engineering stress tests the resources given extreme user behaviors and uncertain availability of resources. Companies often have a cost associated with how much time their services are not running.
Netflix is a leader in chaos engineering. Their service meets several criteria that demands they push the envelope for what is possible in internet services:
- Their services operate in all time zones.
- They have many users.
- Their users consume large amounts of data.
Like many companies, Netflix moved from a physical infrastructure to a cloud infrastructure in 2010. A cloud-based system runs into a whole new set of fires, and chaos engineering allows the Netflix team to create a practice of fighting fires before they happen.
How we got to chaos engineering
Chaos engineering can be a practice when engineering any system, from modelling weather systems to providing regular amounts of energy on the power grid, and, even, to making sure it is possible to provide the resources necessary during a natural disaster. Bill Gates’ now prophetic warning was based on his team’s use of chaos engineering.
The chaos engineering practice grew in the 2010s because more and more companies found themselves maintaining large chaotic systems. First, there are more services being offered in the world. Second, the types of services offered are more complex. Mobile devices, mobile internet, and app stores have created a new around-the-clock, around-the-globe user base that demands a more complex service.
Technologies like cloud services have developed to accommodate this new user. Finally, cloud services create a chaotic environment much different than installing a piece of software to a local computer using a CD, and chaos engineering is a good practice among any tech company to ensure their service is provided.
Benefits of performing chaos engineering
A company’s reputation and bottom-line decrease when their services go down. This cost can be calculated as a dollar-per-hour metric and has become common in many company’s KPIs. A sophisticated team uses chaos engineering to decrease the amount of downtime costs.
When a user streams using Netflix, and their Netflix service fails, they may switch to a YouTube video and Netflix loses money because they were unable to retain that user’s attention. Facebook loses ad revenue when its ads stop working, and Blizzard Entertainment loses video game players if it is known to have regular server outages.
A team would engineer their systems to figure when and why their system might fail, and then figure out how to design their system not to fail.
How chaos engineering works
The engineering community developed Principles of Chaos Engineering, and its primary objective is to increase the resiliency of a system. By testing a system with random failures, DevOps teams get to understand their system’s weaknesses. This lets them make informed decisions around prioritizing tasks to upgrade their systems.
Testing is conducted through chaos experiments. The experiment has a hypothesis for its expected outcome, then observes and compares the actual outcome with the expected outcome. The team’s next moves can be decided from there, perhaps based on previous community experience.
Questions chaos engineering answers tend to go like this: Can we provision the resources to provide our service if…
- …a time zone of servers goes out entirely?
- …a user’s network continually cuts in and out?
- …spin-up time takes too long or a server fails to start altogether?
Chaos tool, Chaos Monkey
At the top of its field, Netflix is pushed to innovate. For chaos engineering, it has built a tool called Chaos Monkey to help test its system against random failures. According to the project’s GitHub, “Chaos Monkey randomly terminates virtual machine instances and containers that run inside of your production environment. Exposing engineers to failures more frequently incentivizes them to build resilient services.”
Chaos Monkey purposefully shuts down parts of the company’s system to force engineers to make it more durable. It uses the CI/CD system Spinnaker. For a closer look at how to use Chaos Monkey, see this page of the documentation.
Additional resources
For more on this topic, check out these BMC Blogs and Guides:
- The Basics of IT Virtualization
- Using Spinnaker with Kubernetes for Continuous Delivery
- Considerations for Monitoring Cloud-Hosted Apps and Infrastructure
- DevOps Guide
- Kubernetes Guide
These postings are my own and do not necessarily represent BMC's position, strategies, or opinion.
See an error or have a suggestion? Please let us know by emailing blogs@bmc.com.