Home > Blogs > vCloud Architecture Toolkit (vCAT) Blog

Leveraging Virtual SAN for Highly Available Management Clusters

A pivotal element in each Cloud Service Provider service plan is the class of service being offered to the tenants. The amount of moving parts in a data center raises legitimate questions about the reliability of each component and its influence on the overall solution. Cloud infrastructure and services are built on the traditional three pillars: compute, networking and storage, assisted by security and availability technologies and processes.

The Cloud Management Platform (CMP) is the management foundation for VMware vCloud® Air Network™ providers with a critical set of components that deliver a resilient environment for vCloud consumers.

This blog post highlights how a vCloud Air Network provider can leverage VMware Virtual SAN™ as a cost effective, highly available storage solution for cloud services management environments, and how the availability requirements set by the business can be achieved.

Management Cluster

A management cluster is a group of hosts joined together and reserved for powering the components that provide infrastructure management services to the environment, some of which include the following:

  • VMware vCenter Server™ and database, or VMware vCenter Server Appliance™
  • VMware vCloud Director® cells and database
  • VMware vRealize® Orchestrator™
  • VMware NSX® Manager™
  • VMware vRealize Operations Manager™
  • VMware vRealize Automation™
  • Optional infrastructure services to adapt the service provider offering (LDAP, NTP, DNS, DHCP, and so on)

To help guarantee predictable reliability, steady performance, and separation of duties as a best practice, a management cluster should be deployed over an underlying layer of dedicated compute and storage resources without having to compete with business or tenant workloads. This practice also simplifies the approach for data protection, availability, and recoverability of the service components in use on the management cluster.

Blog - Leveraging VSAN for HA management clusters_1

Rationale for a Software-Defined Storage Solution

The use of traditional storage devices in the context of the Cloud Management Platform requires the purchase of dedicated hardware to provide the necessary workload isolation, performance, and high availability.

In the case of a Cloud Service Provider, the cost and management complexity of these assets would most likely be passed on the service costs to the consumer with the risk of tailoring a less competitive solution offering. Virtual SAN can dramatically reduce cost and complexity for this dedicated management environment. Some of the key benefits including the following:

  • Reduced management complexity because of the native integration with VMware vSphere® at the hypervisor level and access to a common management interface
  • Independence from shared or external storage devices, because it abstracts the hosts locally attached storage and presents it as a uniform datastore to the virtual machines
  • Granular virtual machine-centric policies which allow you to tune performance on a per-workload basis.

Availability as a Top Requirement

Availability is defined as “The degree to which a system or component is operational and accessible when required for use” [IEEE 610]. It is commonly calculated as a percentage, and often measured in term of number of 9s.

Availability = Uptime / (Uptime + Downtime)

To calculate the overall availability of a complex system, the availability percentage of each component should be multiplied as a factor.

Overall Availability = Element#1(availability %) * Element#2(availability %) * … * Element#n(availability %)


Number of 9s Availability % Downtime/year System/component inaccessible
1 90% 36.5 days Over 5 weeks per year
2 99% 3.65 days Less than 4 days per year
3 99.9% 8.76 hours About 9 hours per year
4 99.99% 52.56 minutes About 1 hour per year
5 99.999% 5.26 minutes About 5 minutes per year
6 99.9999% 31.5 seconds About half minute per year

When defining the level of service for its offering, the Cloud Service Provider will take this data into account and compute the expected availability of the systems provided. In this way, the vCloud consumer is able to correctly plan the positioning of their own workloads depending on their criticality and the business needs.

In a single or multi-tenant scenario, because the management cluster is transparent to the vCloud consumers, the class of service for this set of components is critical for delivering a resilient environment. If any Service Level Agreement is defined between the Cloud Service Provider (CMP) and the vCloud consumers, the level of availability for the CMP should match or be at least comparable to the highest requirement defined across the SLAs to maintain both the management cluster and the resource groups in the same availability zone.

Virtual SAN and High Availability

To support a critical management cluster, the underlying SDS solution must fulfill strict high availability requirements. Some of the key elements of Virtual SAN include the following:

  • Distributed architecture implementing a software-based data redundancy, similar to hardware-based RAID, by mirroring the data, not only across storage devices, but also across server hosts for increased reliability and redundancy
  • Data management based on data containers: logical objects carrying their own data and metadata
  • Intrinsic cost advantage by leveraging commodity hardware (physical servers and locally-attached flash or hard disks) to deliver mission critical availability to the overlying workloads
  • Seamless ability to scale out capacity and performance by adding more nodes to the Virtual SAN cluster, or to scale up by adding new drives to the existing hosts
  • Tiered storage functionality through the combination of storage policies, disk group configurations, and heterogeneous physical storage devices

Virtual SAN allows a storage policy configuration defining the number of failures to tolerate (FTT) which represents the number of copies of the virtual machine components to store across the cluster. This policy can increase or decrease the level of redundancy of the objects and their degree of tolerance to the loss of one or more nodes of the cluster.

Virtual SAN also supports and integrates VMware vSphere® High Availability (HA) features, including the following:

  • In case of a physical system failure, vSphere HA powers up the virtual machines on the remaining hosts
  • VMware vSphere Fault Tolerance (FT) provides continuous availability for virtual machines (applications) up to a limited size of 4 vCPUs and 64 GB RAM
  • VMware vSphere Data Protection™ provides a combination of backup and restore features for both virtual machines and applications

Blog - Leveraging VSAN for HA management clusters_2

Architecture Example

This example provides a conceptual system design for an architecture to implement a CMP in a cloud service provider scenario with basic resiliency and that is supported by Virtual SAN. The key elements of this design include the following:

  • Management cluster located in a single site
  • Two fault domains identified by the rack placement of the servers
  • A Witness to achieve a quorum in case of a failure, deployed on a dedicated virtual appliance (a Witness Appliance is a customized nested ESXi host designed to store objects and metadata from the cluster, pre-configured and available for download from VMware)
  • Full suite of management products, including optional CSP-related services
  • Virtual SAN general rule for failure to tolerate set to the value of 1 (two copies per object)
  • vSphere High Availability feature enabled for the relevant workloads

This example is a starting point that can provide an overall availability close to four 9’s, or 99.99%. Virtual SAN provides greater availability rates by increasing the number of copies per object (FTT) and the number of fault domains.

Some of the availability metrics for computing overall availability are variable and lie outside the scope of this blog post, but they can be summarized as the following:

  • Rack (power supplies, cabling, top of rack network switches, and so on)
  • Host (physical server and hardware components)
  • Hard disks MTBF (both SSD and spindle)
  • Hard disks capacity and performance (influence rebuild time)
  • Selection of the FTT, which influences the required capacity across the management cluster

Blog - Leveraging VSAN for HA management clusters_3

The complete architecture example will be documented and released as part of the VMware vCloud Architecture Toolkitfor Service Providers in Q1 2016.