AZ Failure Recovery Within VMware Cloud on AWS

Occasionally a customer will ask if there is any risk of losing data in the event of a transient AZ failure. The short answer is no; there is no need to worry. Temporary failures such as a power outage are not terminal.


A source of this confusion is a mix of terminology. VMware and AWS personnel will sometimes refer to the fixed hosts VMC utilizes (i3.metal) as ‘ephemeral.’   From the AWS perspective, i3.metal nitro hosts are a failable asset. Storing data on the local NVMe provides no long-term durability guarantee.


Enter vSAN

This changes when we add vSAN into the mix, because vSAN is not storing data locally on a single host but across a collection of hosts using a shared-nothing distributed object-store. The loss of any single host is not a terminal event, but what about the cluster? What happens if an AZ loses power, and every host is powered off all at once?


The answer, of course, is an availability interruption.  Single AZ SDDCs would go offline, but the data would be secured.  This is because vSAN secures any writes per the storage policy before acknowledging the write completion to the guest OS.  The bottom line, any inflight transactions will be secured on the local media of the powered-down hosts.


SDDC Recovery

In the event of transient failure, VMware would detect the failure, and would immediately contact AWS to coordinate the recovery.  As soon as AWS personnel restore services within the AZ, any impacted hosts are powered back up.  Once the cluster has finished powering up, VMware powers up the Management VMs, returning the SDDC to full service.

If that sounds intolerable, fear not. VMware can mitigate this risk entirely by deploying a stretched cluster to protect from even transient AZ failures. Multi-region and or multi-cloud resilient solutions, to protect from even catastrophic failures, are available using VMware Site Recovery.


