Home > Blogs > Rethink IT


Avoiding Chaos Monkeys — clouds that proactively avoid customer downtime

Next week at VMworld, I'm presenting a session called “Escaping the chaos monkey: enterprise vs. commodity clouds”, detailing the differences in the two types of cloud, and the resulting impact to applications. The session is CIM 2865, Wednesday at 12.30pm. The Chaos Monkey is a process developed by Netflix to simulate the unreliability of commodity clouds by randomly killing virtual machines (VMs).

Commodity clouds are designed to trade off VM uptime to reduce hardware costs & boost profitability. This assumes that all applications will be written as distributed systems that can take care of their own uptime, and that the uptime of individual virtual machines (VMs) is not important. Amazon’s EC2 service, for example, has no guarantee for VM uptime.

To illustrate the contrast with an enterprise cloud approach, I’d like to share an example from VMware partner StratoGen, a hosting and cloud provider in London. Late last week, they needed to do emergency power maintenance in one of their London datacenters, Telehouse West. In a commodity cloud, preventative maintenance typically means some VMs are going to die, because there’s no infrastructure to keep VMs running while the hardware is replaced.

But StratoGen’s customers didn’t suffer an outage. In fact, they wouldn’t have known anything was happening if they hadn’t read the service bulletin. Here’s what StratoGen did: prior to commencing maintenance, all running customer VMs were migrated from StratoGen’s Telehouse location to another data center, Park Royal, in West London using VMware vMotion. The power maintenance was carried out as planned, and several hours later the VMs were vMotioned back to Telehouse. No VMs were harmed during this process 😉

StratoGen’s approach is designed for the overwhelming majority of existing applications that assume a reliable server infrastructure layer. Often, these applications cannot be easily altered or re-written, or it is simply uneconomic to do so. This is the market served by enterprise public clouds, and StratoGen is a good example of the kind of service quality that helps those applications run better by avoiding individual VM downtime.

The irony is that because absolutely nothing went wrong, this isn't something that is likely to generate headlines like other recent cloud service interruptions. Sometimes it's what doesn't happen that is valuable.