VCF Storage (vSAN)

Improved Maintenance and Recovery Options for vSAN Stretched Clusters

vSAN stretched clusters are an extraordinarily popular topology deployed by a large percentage of our customers.  Stretched clusters have proven to be an easy and reliable way to achieve site-level resilience without some of the complexity associated with metro storage clusters using storage arrays.  For VCF 9.0, vSAN introduces new features that improve the flexibility, uptime, and operational tasks for stretched clusters.  First, is the support of stretched compute clusters mounting the datastore of a stretched vSAN storage cluster, making stretched clustering for vSphere clusters easy.   Next, the new site-wide maintenance capability takes the concept of “maintenance mode” to the site level for a stretched cluster.  And third, the manual site takeover feature gives the administrator a self-service way to recover a single site in the event of a severe double site failure of a stretched cluster.  

A Better Way to Stretch vSphere Clusters across Sites

vSAN 8 U2 provided support for vSAN storage clusters in a stretched cluster topology, but it had limitations.  The only type of client cluster that could mount the target datastore was a vSAN cluster that was also stretched.  Any vSphere cluster that needed to mount the datastore could only live in one site, or the other.  This meant that for vSphere clusters, one could provide resilience of the data across two sites, but not resilience and high availability of the VM workloads across the two sites.  The reason for this lack of support was that the hosts that comprised the vSphere cluster had no concept of a fault domain as it is implied in a vSAN stretched cluster.

vSAN in VCF 9.0 fills this gap, and allows for a vSAN storage cluster that is stretched across two data sites to be consumed by a vSphere cluster.  In a stretched cluster topology, this will allow for a vSAN storage cluster to be stretched across two sites (previously supported) and one or more vSphere cluster stretched across the same two geographic sites to mount that datastore (not previously supported).  This will effectively create a much easier “metro storage cluster” where it removes the complexity of traditional metro storage clusters using storage arrays, by using the simple and robust architecture of vSAN.

Simplified Site Maintenance for vSAN Stretched Clusters 

Historically, a maintenance mode event in vSAN occurred at the host level.  For customers who used vSAN in a stretched cluster environment, there was no easy or automated way to place all of the hosts that comprise a site into maintenance mode.  This “site level” maintenance task has been a common request by our customers.  While a customer could perform this action manually, it involved several steps, was prone to error and depending on conditions, may not achieve the desired consistency of data.

Site maintenance in a vSAN stretched cluster becomes much easier in VCF 9.0. 

 It simplifies the actions required by the administrator to little more than a click in the UI, or a call via API.  It aims to ensure that all hosts in one site are safely placed into maintenance mode and the data remains consistent.  The workflow not only provides a reliable way to enter an entire site into maintenance mode, but an easy way to exit the site out of maintenance mode.

Self-service Recovery for Sustained Dual-Site Outage

A vSAN stretched cluster allows for any one of the three sites to fail and the data will remain available.   Previously, a vSAN stretched cluster prevented data from being available in the event of a simultaneous failure of a data site and a witness site due to the quorum system.  The quorum system’s ability to determine availability helps provide consistency of the data – safeguarding it from split-brain scenarios.  In conditions of a sustained, simultaneous failure of a data site and a witness site, the object data in the remaining site would be unavailable.  The process of making this data available in the one remaining site could only be achieved by contacting support services (GS), and their manual efforts to make the data available.  This was time consuming and error prone for our customers.

The manual site takeover capability for vSAN in VCF 9.0 aims to provide a self-service activity in the scenario where one data site is in maintenance mode, followed by a cluster experiencing a simultaneous double failure at the other two sites. In this scenario, the site in maintenance mode can be recovered to an available state, so that VM instances can be powered up and consume the storage. 

This feature will initially debut in a limited availability via Broadcom’s “Technical Qualification Request” (TQR), which replaces VMware’s “Request for Product Qualification” (RPQ) for features that are not generally available.  To submit a TQR, please reach out to your account team to contact vSAN Product Management.

Conclusion

vSAN stretched clusters are a powerful option for environments that require the highest levels of data resilience and VM uptime. Simplifying maintenance activities, and improving recovery scenarios provide powerful improvements to the simplest way to deploy a multi-site VMware Cloud Foundation deployment.


Discover more from VMware Cloud Foundation (VCF) Blog

Subscribe to get the latest posts sent to your email.