VMware Cloud Foundation

Protecting Your VCF Management Domain

Disaster Proofing your SDDC management domain is a breeze with VMware Cloud Foundation.

Overview

VMware Cloud Foundation is an integrated software stack that bundles compute virtualization (VMware vSphere), storage virtualization (VMware vSAN), network virtualization (VMware NSX), and cloud management and monitoring (VMware vRealize Suite) into a single platform that can be deployed on premises as a private cloud or run as a service within a public cloud. Cloud Foundation helps to break down the traditional administrative silos in data centers, merging compute, storage, network provisioning, and cloud management to facilitate end-to-end support for application deployment.

It maps a converged view of physical resources (e.g., CPU, memory, storage, and network) to a logical abstraction. Cloud Foundation overlays a software suite on top of the physical hardware for operations management, event reporting, and auditing. This enables Cloud Foundation to provide consistent hardware management across switches, servers, and storage, as well as a consolidated management solution across your SDDC.

In addition to the SDDC software stack, Cloud Foundation includes SDDC Manager, an application that automates the entire SDDC lifecycle (from initial bring-up, to configuration and provisioning, to upgrades and patching), and simplifies day-to-day management and operations.

One of the fundamental aspects of Cloud Foundation is the concept of Workload Domains. A workload domain is a policy based resource container with specific availability and performance attributes that combines compute (vSphere), storage (vSAN) and networking (NSX) into a single consumable entity. A special purpose workload domain dedicated to infrastructure and management tasks is a Management Domain. During bring-up, the management domain is automatically created on a vSAN cluster in a Cloud Foundation system. It contains the following management components:

  • SDDC Manager
  • vCenter Server and Platform Services Controllers
  • vRealize Log Insight
  • NSX

Management Domain Components

Why protect management domain?

The management domain is the heart of any SDDC stack, it hosts the entire management infrastrucutre  such as vCenter Server and NSX Management Plane Components. It is brought up on a single site with no protection or avoidance from disaster and a single point of failure could disrupt application availability. Thus, it becomes important to protect the management domain to ensure business continuity.

Solution

We could take advantage of the disaster avoidance and protection solution vSAN offers in the form of Stretched Cluster. A vSAN Stretched Cluster is a specific configuration implemented when disaster/downtime has to be avoided in any situation where business continuity is a critical requirement. vSAN Stretched Clusters offer resilience against a site loss by balancing VMs synchronously between datacenters. They can also be deployed for either planned maintenance and load balancing.

vSAN is tightly integrated with vSphere HA which will automatically restart VMs seamlessly on the secondary site if a problem occurs on the primary site. vSAN works very closely with vSphere Distributed Scheduler (vSphere DRS) which helps balance the resource load in a vSphere cluster by intelligently allocating and placing resources or offering placement recommendations.

The management domain could be stretched across two sites so that it is protected. So, if one site is disrupted due to a disaster; it would be business as usual for organizations since the management domain and the stack is available through the other site. The procedure to stretch the management domain is automated and simple.

Stretching Management Domain

Let us go over the steps of stretching the management domain of a NSX-V Cluster. Now, there are a few supported topologies that could be implemented, we will go with vSAN over L2 network here. As an example, we will follow a use case with two availability zones in two buildings in an office campus – AZ1 and AZ2.

L2 vSAN Stretched Cluster

A word about Availability Zones

An availability zone is a collection of infrastructure components totally isolated from other zones to protect itself from failures across data centre. It runs on its own distinct, independent infrastructure with its own power, cooling, network, security and other physical aspects.

Prerequisites

  • Ensure appropriate vSAN Enterprise license has been applied.
  • Deploy availability zone with management, vMotion and vSAN network.
  • These networks should have L2 connectivity between availability zones. vSAN network should have routing to vSAN network of Witness host as well.
  • vLAN IDs must be identical across zones

Prepare the Cluster

The management domain we want to stretch is on AZ1 and contains the default cluster “SDDC-Cluster1”. This cluster contains four ESXi hosts.

AZI has the default bring-up network pool “bringup-networkpool”.

Commission Hosts

  • Create a network pool “bringup1-networkpool” on AZ2. Ensure the VLAN IDs are identical.
  • Commission four hosts in AZ2 and associate then to “bringup1-networkpool”

You can see the just commissioned four hosts below – esxi-5.vrack.vsphere.local, esxi-6.vrack.vsphere.local, esxi-7.vrack.vsphere.local and esxi-8.vrack.vsphere.local. They are unassigned currently.

In this example, we will stretch vSAN network VLAN between the two availability zones over L2.

Witness Host

vSAN Witness host would be on a third site outside of AZ1 and AZ2 and will not be managed by VCF. Prepare the host and make sure vSAN network connectivity between AZ1 and AZ2 ESXi hosts is fine.

I’ve already prepared a witness host which we will be using to configure Stretched-cluster. I’ve deployed a Witness Appliance. The procedure can be found at VMware Docs Deploying a vSAN Witness Appliance.

The vSAN Witness we will be using is witness-vsan.vrack.vsphere.local, its vSAN IP being 10.0.4.88

Stretch the Cluster

Current releases of VCF have no GUI option to configure Stretched-Cluster. Hence we will need to use sddc-manager command set.

  • SSH into SDDC Manager using “vcf” administrative account.
  • Switch to superuser account and navigate to /opt/vmware/sddc-support directory.
  • Enter the following command:

./sos –stretch-vsan –sc-domain <DOMAIN NAME> –sc-cluster <CLUSTER NAME> –sc-hosts <HOSTFQDN,HOSTDQND2,…> –witness-host-fqdn <WITNESS HOST FQDN> –witness-vsan-ip <WITNESS VSAN IP> –witness-vsan-cidr <WITNESS VSAN CIDR> –esxi-license-key <LICENSE KEY>.

root@sddc-manager [ /opt/vmware/sddc-support ]# ./sos –stretch-vsan –sc-domain MGMT –sc-cluster SDDC-Cluster1 –sc-hosts esxi-5.vrack.vsphere.local,esxi-6.vrack.vsphere.local,esxi-7.vrack.vsphere.local,esxi-8.vrack.vsphere.local –esxi-license-key AAAAA-BBBBB-CCCCC-DDDDD-EEEEE –witness-host-fqdn witness-vsan.vrack.vsphere.local –witness-vsan-ip 10.0.4.88 –witness-vsan-cidr 10.0.4.0/24

Welcome to Supportability and Serviceability(SoS) utility!

Logs : /var/log/vmware/vcf/sddc-support/stretchCluster-2019-09-12-05-26-08-97790

Stretch Cluster operation log : /var/log/vmware/vcf/sddc-support/stretchCluster-2019-09-12-05-26-08-97790/sos.log

Starting vSAN stretched cluster operations..

vSAN stretch operation started 

Api Response:{“id”:”46467e33-c2a0-4a44-ba2f-117767cfc8d4″,”link”:null,”taskId”:”46467e33-c2a0-4a44-ba2f-117767cfc8d4″,”resourceId”:”fc4123ef-9fd0-4656-ae7a-c0b3bc155b89″,”resourceType”:”ESXI”,”state”:”IN_PROGRESS”,”description”:”Stretch vSAN Cluster – SDDC-Cluster1 in VMware Cloud Foundation”,”errors”:null,”timestamp”:1568265969796}

Workflow triggered, please track the task status in SDDC Manager UI

Monitor the task in SDDC Manager UI

Validate Stretched-cluster configuration

  • Once the configuration is complete, validate stretched cluster Fault-domain configuration. Verify Stretched Cluster state and Witness Host state are healthy.

Monitor vSAN Health

Verify health status of Stretched cluster and its components

  • On home page, click Hosts and Clusters. Then select SDDC-Cluster1.
  • Click Monitor -> vSAN -> Health
  • Click Retest
  • Validate overall health and Stretched cluster health in particular

We can also notice that the initial four host SDDC-Cluster1 is now expanded to an eight host cluster. We have successfully managed to stretch the Management domain.

Day N operations for vSAN Stretched-Cluster supported in VCF

Apart from stretching a cluster, the following Day N operations is possible in VCF which are automated.

  • Un-stretch a Cluster – convert to a standard singe site vSAN Cluster
  • Expand a Stretched-Cluster – adding hosts
  • Replace a failed host

Conclusion

The importance of Stretched clusters is realized when one of the site is disrupted. In such a scenario, the organization is able to continue its business since the data is still accessible through the other site. It’s now possible for solution engineers to accelerate deployment of workload domains and protect them from disasters by leveraging automation and orchestration within VCF.