Home > Blogs > Rethink IT


Return Of The Chaos Monkey: Positive Lessons From A Cloud Failure

It seemed like the Chaos Monkey had struck again when Amazon Web Services (AWS) customers suffered major disruption yesterday, caused by a failure cascade that began with a network outage and rapdily spread to storage and compute resources. The twitterverse and blogosphere lit up as the blame game began.

Some blamed AWS, some blamed the network, some blamed AWS' customers and some blamed it on the boogie. That doesn't seem very helpful, but it does highlight the issue of the trade-off between cloud diversity and complexity.

Services that went down yesterday were accused of failing to have enough diversity in cloud providers and how they designed their applications for AWS. One way to reduce this risk is to eliminate common failure modes such as a common cloud provider — but that is often easier said than done, as concisely described in this blog post from @justinsantab.

The essence of the problem is that there can be a significant cost of complexity to gain diversity. Costs include writing an application in such a way that it is independent of the cloud's API, re-training operations staff and so on. Some companies affected yesterday decided that it wasn't worth hedging single cloud risk by taking steps to diversify, and so suffered downtime. Depending on the trade-offs involved, this may have been a good decision… or not. 

Another cost is factoring in the cloud's behavior and feature differences. Different clouds can have fundamentally different design assumptions that have major repercussions on how an application performs in each cloud (the subject of my "Escaping the Chaos Monkey: Enterprsise vs. Commodity Clouds" presentation).

One set of cloud providers have decided that making it easier for VMware customers to diversify their cloud providers is good for business: VMware vCloud Powered service providers, all of whom offer the same vCloud API and support the Open Virtualization Format (OVF). This reduces the barrier to redeploying a cloud applications and data elsewhere. Being tied to an API can make it very hard to leave a cloud provider.

There are more than 4,100 service providers in the VMware Service Provider Program as of last week, and more and more of them are achieving the vCloud Powered designation — three more in the first three days of this week, for example. This dramatically reduces one complexity hurdle: API and workload format dependency. They all run VMware vSphere and vCloud Director, which eliminates another set of compatibility challenges for VMware virtualized workloads (Chris Wolfe at Gartner blogged on that challenge earlier this year)

Different cloud provider design decisions can also make these clouds behave differently, and there's a subset of vCloud Powered providers who have tackled that too: vCloud Datacenter providers. They have implemented a common service definition, audited by VMware, so that collectively they deliver a globally consistent cloud computing service. Today, there are six service providers in this program with global coverage: Bluelock, Colt, CSC, SingTel, Softbank and Verizon.

The devil is in the details when it comes to supplier diversity of any kind, and this is by no means some kind of panacea. But vCloud Powered and vCloud Datacenter do reduce the complexity and cost of cloud provider diversity, and that's a good thing.