Failover Time for vCenter Server 6.0 Protected by vSphere HA

By Jianli Shen, Kimberly Wang, and Lei Chai

There are many advantages to virtualizing vCenter Server over installing it on physical devices; these have been blogged about already. VMware best practice also recommends the deployment of vCenter Server as a virtual machine and to use vSphere High Availability (HA) to provide this functionality. The typical scenario is to place vCenter Server in a vSphere HA cluster. HA will protect vCenter Server just like any other virtual machine. If the ESXi server that is running vCenter Server goes down, HA will kick-in and power on this VM from the shared storage on another HA cluster member host. In this blog, we will show the downtime of vCenter Server in this scenario with experiments that simulate a production deployment.

We performed our experiments with a vCenter Server VM deployed in an HA-enabled cluster with vCenter Server itself managing an inventory of 64 hosts and 6,000 VMs. The experiment shows that in case the ESXi server which hosts vCenter Server fails, vCenter Server will be powered on by another host in the cluster as expected in the HA scenario. After vCenter Server is powered on, it takes time to boot up into full function, and 64 hosts/6,000 VMs will be presented in the inventory when the administrator is able to log into the vSphere Web Client again. In our experiments, the whole procedure from failure to vCenter Server administrator being able to log into the vSphere Web Client took about 7 minutes and 40 seconds, with most of the time spent booting up vCenter Server into full function. During vCenter Server downtime, customers will still be able to access their VMs; only the vCenter Server administrator is impacted.

In the following sections, we describe our experiments and numbers.

Experimental Deployment Scenarios

We created the scenario of vCenter Server in an HA-enabled cluster environment. We used a host simulator to simulate the 64 hosts/6,000 VMs inventory for vCenter Server. The deployment is shown in figure 1, below.

: Figure 1. Test-Bed Setup

In this setup:

HostA is the host where we first deploy vCenter Server 6.0. HostB is the host that acts as a backup host for vCenter Server to failover in case hostA fails. HostA and HostB are managed by vCenter Server in an HA-enabled cluster, which is Cluster2 in the figure.
Both hosts are Dell PowerEdge R620 with Intel® Xeon® E5-2650@2.00GHz, 128GB memory and 3.7TB shared storage.
vCenter Server is a virtual appliance deployed on HostA with 16 vCPUs and 32GB RAM.
vCenter Server also has a large inventory with 64 hosts and 6,000 VMs that is in Cluster1.

The figure shows what happens before and after HostA fails. During the experiment, we powered off HostA so the vCenter Server VM went down as well. Then, the HA agent detected the failure of HostA and initiated the action of powering on the previously running VMs (vCenter Server, in this case) in HostA on HostB. Once vCenter Server was powered on, it recovered its inventory as during the normal power-on process.

Performance Result

We measured the time from the point vCenter Server VM stopped responding, to the point vSphere Web Client started responding to user activity again.

With the 64 hosts/6,000 VMs inventory, the total time is around 460 seconds (that is, 7 minutes and 40 seconds), with about 30-50 seconds for HA to get into action. The rest of the time is spent on the vCenter Server power-on process.

To see the impact of inventory size that vCenter Server has, we compared it with similar scenarios, where vCenter Server has 1 host/1 VM or 32 hosts/4,000 VMs inventory, the downtime is around 410 seconds (6 minutes and 50 seconds) and 439 seconds (7 minutes and 19 seconds) respectively. This means the downtime does increase when the inventory size increases in the vCenter Server. But considering the whole power-on process as shown in figure 2, the increase is reasonable.

As a reference, we looked at the vCenter Server reboot time. We issued a vCenter Server reboot command, then waited until it was back to full function and measured the latency for this period. Then we compared it to the total time it takes to fail over the vCenter Server, including its reboot time. As shown in the following figure 2, there is around a 30-50 second time period when a vCenter Server fails over from HostA to HostB, compared to rebooting the vCenter Server on HostA. This 30-50 second period is the time needed for the HA agent to detect the failure on HostA and then issue the command to power on the vCenter Server at HostB. Figure 3 shows a closer look at the time failover takes for each activity.

: Figure 2. Difference in seconds between normal reboot and HA failover (reboot + HA) in very small, medium, and large inventories

: Figure 3. Total time of failover, including time it takes to reboot during failover and time for vSphere HA activity

Summary

In this article, we show that deploying vCenter Server as a VM makes protecting it very easy by using vSphere HA. The solution is feasible in reality and the downtime is within an expected tolerance level. Administrators will be able to regain control of the vCenter Server within less than a 10-minute window. Meanwhile, customers will still have their services running on their VMs in Cluster1.