VMware Cloud on AWS

2 Failure Toleration Requirements within VMware Cloud on AWS

Recently an account manager reached out and asked. Why are customers required to use a policy that can survive 2 Failures (FTT2) when a cluster has six or more hosts?

It comes down to maintaining the highest level of availability for the appropriately sized cloud SDDC.  Let’s look at this in more detail –  To understand this requirement, we need to take a step back and look at the solution as a whole. At a high level, Amazon runs and maintains the foundational services (EC2, Route53, EBS, etc.). VMware consumes these services to provide VMC on AWS. Part of the service VMware provides is a continuous monitoring subsystem that monitors the health of every vSphere host within the VMC fleet. If any problems are detected, then the system attempts to remediate the situation automatically. It will work through a slew of steps trying to correct any recoverable failures. If the service is not able to fully repair the fault, then the host is replaced. This proactive process enables customers to run clusters at scale without the maintenance or hot spare host typically found in on-premises deployments. It is instead leaning into the elasticity of the AWS cloud to add resources when needed on demand.

What’s the problem?

If VMware is continuously monitoring the fleet, repairing any potential failures, and automatically replacing any troubled hosts, then why does it matter how many hosts are in the cluster?

As the cluster scale increases, the number of VMs and vSAN components being managed increases. Replacing a compromised host requires relocating and or rebuilding any data stored on that host. A process referred to as resynchronization. The time needed to complete this resynchronization, fully reprotecting the data from failure, depends, it can take several hours and even days in some situations.

The exception being Elastic vSAN, which moves the EBS volumes as part of any host maintenance activity, dramatically decreasing the resynchronization regardless of capacity.

SLA Impact

Working around this reality, obfuscating the complexity of managing this process is mirrored in the SLA requirements. Where we find different requirements for fixed hosts using local SSDs (i3.metal) compared to a cluster using Elastic vSAN (r5.metal – EBS).

Fixed

As long as an i3.metal cluster has five hosts or less, the potential for failure balanced against the projected time to replace a host works out. VMware can guarantee the VMware Cloud on AWS service. Confident that it will be able to reprotect any workload before encountering a second failure.

vSphere / vSAN Cluster

That math, however, meets a tipping point at six hosts. A six-host cluster has sufficient distinct components that the risk of double failure must be addressed, which is why VMware requires customers to upgrade to 2 Failure protection.

 

Elastic vSAN

Due to the innovative host maintenance/recovery process, any Elastic vSAN Cluster is protected with 1 failure toleration. Customers may still choose to implement 2 failure protection to mitigate the availability gap of a host replacement, but it is not required for durability purposes.

Does this change when using a Stretched Cluster?

At first, it may be tempting to use the ‘extra’ copy of the data in the other AZ as protection from failure, forgoing local resiliency all together. This strategy has a couple of potential issues. Due to the lack of local redundancy, any local failures will require a full resync, increasing cross-AZ traffic. Perhaps more importantly, upon AZ’s failure, the workload would be without local resiliency. For this reason, any stretched cluster workload requires both Dual Site Mirroring and 1 Failure toleration protection regardless of scale or instance type.

What if I choose to use a different policy?

The VMware Cloud on AWS SLA attempts to simplify all the complexity of running operations within the AWS cloud by clearly defining the baseline requirements for VMware to guarantee operations within the AWS infrastructure. Customers who have workloads that can tolerate service interruptions may choose to opt-out of the SLA. Accepting the potential data loss risk or choosing to mitigate through an alternate strategy. The service leaves room for the customer to decide what’s best for their workload and or business, while policy-based management empowers this decision to be made on a per VMDK basis. Together they allow any customer to realize hybrid operations today safely.

Summary

The VMware Cloud on AWS Service delivers a fully managed production-ready Software-Defined Data Center. Through thoughtful consideration of the potential risk, VMware has encapsulated all the complexity into a simple set of durability guidelines.

Standard Clusters

Cluster size Site disaster tolerance Failure to tolerate

3-5

None – standard Cluster 1 Failure

6+

None – standard Cluster 2 Failures

Stretched Clusters

Cluster size Site disaster tolerance Failure to tolerate

6-16

Dual Site Mirroring (stretched cluster) 1 Failure

 

At the same time, storage policy-based management empowers customers to opt into and out of those guidelines.  Aligning the cloud to the workload, instead of the other way around.

Availability

To view the latest status of features for VMware Cloud on AWS, visit https://cloud.vmware.com/vmc-aws/roadmap.

Resources:

Take our vSAN 6.7 Hands-On Lab here, and our vSAN 6.7 Advanced Hands-On Lab here!