Chaos engineering has definitely become more popular in the decade or so since Netflix introduced it to the world via its Chaos Monkey service, but it’s far from ubiquitous. However, that will almost certainly change over time as more organizations become familiar with its core concepts, adopt application patterns and infrastructure that can tolerate failure, and understand that an investment in reliability today could save millions of dollars tomorrow.
In this episode of our Cloud & Culture podcast, VMware’s Sean Keery (aka “the minister of chaos”) shares the lowdown on the how and why of chaos engineering. He explains how to get started and what types of technologies will help build a solid foundation for measuring reliability, and he shares some real-world examples of how chaos engineering has kept applications online and saved companies money.
Some highlights from the episode are below, but you’ll want to listen to the entire episode for more details on implementing chaos and where the field is heading, among other things.
Chaos engineering in a nutshell
“Chaos engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production. It allows us to identify weaknesses before they manifest in system-wide aberrant behaviors.
. . .
“Netflix started by shutting down machine instances to see how they affect overall performance of their content delivery network. In Google's case, what they do is inject bad data into a web request, or they can take that web request and just throw it away.
. . .
“I like to think of it as an extension to test-driven development. We start with our unit tests of the actual package we're developing, then we do an integration test with the systems we know it's interacting with. And then chaos engineering provides us a larger-scale testing platform.”
Step 1: Imagine failure scenarios and limit blast radius
“For most customers getting started, what we do is hold a ‘game day.’ . . . And we say, ‘What could you possibly do to your system that would impact its availability to the customer or its security of the customer data?’ . . . We have people sit down in a room and start brainstorming scenarios that could have impacts.
“And then the second half of the day would be picking the scenarios that we think would limit the blast radius. We don't want to just shut down our Cisco routers, because that's going to take down our entire system. So we talk about blast radius and say, ‘How could we do a small experiment and measure that impact?’.
“Once we do that, then we identify what we need to be measuring. When you talk about tools, really the most important tool to have to get started with is some sort of observability tool—ideally one that can look at your applications, your network, and your compute resources to try and identify whether your experiment has actually caused reliability to decrease over the system. That would be a really great way to start.
“Obviously there's a lot of other tools we can talk about. The Chaos Monkey was the first tool that Netflix put out there and it was designed to allow developers to shut down their own systems—so it's self service chaos. This is a little more mature chaos.”
Outages, breaches, and attrition cost more than planning
“My background is in financial services, and what I found was a great way to encourage people to understand where [chaos engineering is] going to add value quickly is by measuring the effect of downtime of your most critical system. When we looked at ours, it was about $50,000 a second, so about $3 million a minute. So it made sense to invest. If it's one engineer at, say, a hundred thousand dollars a year, all their time spent on stopping this system from going down would be a 30 times return on investment.
“The second piece is security. The average cost of a system breach, according to the latest Verizon report is $3.8 million. So, if you can potentially stop a breach occurring by creating an experiment that would minimize a configuration error in your credentials it's value.
“The last piece I would talk about is your team. [The goal of site reliability engineering] is to reduce toil for your operators or developers. It doesn't add value for them to be doing the same thing over and over again. So, for me, chaos engineering injects some fun into the job: ‘Go and try and break our systems.’ For me as an engineer, as an architect, that's something I look forward to coming to work to do.”
Before chaos, establish a foundation of reliability
“The first thing you need is a service level objective. What do I want my reliability to be? If you don't know that, you can't build anything else on top of it. For an iPhone launch, if I have 3 minutes of downtime, that's not acceptable. But for the hunting-license scenario for another customer I worked on, well, we can be down about 5 hours a week, as long as it's not during hunting season. So that service level objective is the key.
“A lot of people use the ‘nines.’ So is it 3 nines, about 43 minutes a week, that your system can be down? Great, let's start with that. Then how do we measure that? That's where those tools become important—something like Tanzu Observability or vRealize Operations Log Insight—so I can actually know that my system is available. Because I can't have reliability without availability.
“Kubernetes is a very nice-to-have when you get past one or two applications, because it allows me to consolidate that information into a shared workstream that then tells me about my portfolio of applications.”
Learn more about chaos engineering
Should that be a Microservice? Part 5: Failure Isolation