Last March we publicly announced a preview of an upcoming feature being developed for the VMware Cloud on AWS Service; Stretched Clusters, and Multi-AZ SDDC Instances. Yesterday for this exciting new feature. Today I would like to explain the underlying technology behind this new type of VMC SDDC, but first, let’s explore why it was needed. released an overview
AWS Regions and Availability Zones
Amazon’s global infrastructure is broken up into Regions. Each Region supports the services for a given geography. Within each Region, Amazon builds isolated and redundant islands of infrastructure called Availability Zones (AZ). When VMware deploys a vSphere Cluster as part of the VMware Cloud on AWS managed service; all hosts for a given cluster are placed into a single AZ. As a result, VMC on AWS SDDC Clusters are vulnerable to the incredibly rare but real threat of an AZ failure. To negate this consideration, Amazon recommends deploying a given service across multiple Availability Zones; utilizing network failover to mitigate any failures.
Enter vSAN Stretched Clusters
Soon all VMC on AWS customers will have the option to deploy a Stretched Cluster. When selected, a vSAN Stretched Cluster is created across three AZ’s, creating a vSphere Cluster that can survive the loss of entire Availability Zone. During the Create SDDC request, the cloud admin selects which VPC subnet should be linked to the tenant workload logical network. What’s not immediately apparent is that in doing so they are also designating which AZ the vSphere hosts are deployed in.
To protect against split-brain scenarios and help measure site health a managed vSAN Witness is also created in a third AZ. The third AZ is picked at random from the remaining Availability Zones. The Witness has been engineered to run on an EC2 M5.XL AMI to reduce the cost to the customer.
Finally, vSAN Fault Domains are configured to inform vSphere and vCenter which Hosts reside in which Availability Zones. Each Fault domain is named after the AZ it resides within to increase clarity.
Managing a Stretched Cluster
Upon initially logging into your newly provisioned Multi-AZ SDDC vCenter instance you will discover what appears to be a standard vSphere Cluster.
Taking a closer look at vSAN Cluster, Fault Domain configuration reveals a fuller story. Here we can see how the nodes have been evenly split between the two AZs specified in the SDDC creation wizard. Furthermore, we can see that the Cluster can survive the loss of any single Fault Domain without impacting running workload.
However, there isn’t anything to “manage” on the VMC Cluster. VMC on AWS abstracts away all the complexity traditionally associated with running a multi-site active/active vSphere Cluster. Perhaps more importantly, VMware operations are responsible for the monitoring and maintenance of the Cluster. The only thing that customers need do to make VMs capable of surviving the loss of an entire Availability Zone is to configure the VM Storage Policy.
Policy-Based Site and local data management
Those familiar with SPBM will notice the VM policy wizard has been completely refactored. The old PFTT/SFTT nerd nobs have been replaced with a straightforward vSAN policy panel. Here you can see how the Availability tab combines both local and site data management into a single view.
Diving into the Site Disaster Tolerance settings, we discover that we can control where our data is stored. There are three options for managing data placement.
Similarly, the cloud admin can specify how vSAN should store the data within each fault domain via the Failures to Tolerate setting. Failure to Tolerate is a simple concatenation of the previous Failure Toleration Methodology (FTM) and Secondary Failures to Tolerate (SFTT) settings. The new interface removes all guesswork from policy authoring.
Merely declare how you would like vSAN to protect any individual VM and or VMDK and the underlying storage system will make it so. Which brings us to the point of this offering; by assigning a Site Disaster Tolerance setting of Dual Site Mirroring in a VM Storage Policy you can make any VM resilient to an AZ failure. In the event of a failure, vSphere HA will restart the affected VM on the surviving hosts.
This capability is not free. It is, in fact, the Cloud equivalent of an On-premises vSphere Metro Storage Cluster. The enabling technology here is vSAN, and its ability to synchronously commit any writes across two fault domains (AZ’s).
However, Stretched Clusters for VMware Cloud on AWS allows customers to manage costs by enabling local and site-wide availability to be tuned to the needs of each individual item via VM Storage Policy assignment. I look forward to digging in and exploring more as we continue to develop VMware Cloud on AWS. If you have a particular question regarding Stretched Clusters for VMC on AWS, please don’t hesitate to ask!