In a recent blog, the VMware vSphere team shared the following performance improvements in vSphere 6.7 vs. 6.5:

Moreover, with vSphere 6.7 vCSA delivers phenomenal performance improvements (all metrics compared at cluster scale limits, versus vSphere 6.5):
2X faster performance in vCenter operations per second
3X reduction in memory usage
3X faster DRS-related operations (e.g. power-on virtual machine)

As senior engineers within the VMware Performance and vSphere teams, we are writing this blog to provide more details regarding these numbers and to explain how we measured them. We also briefly explain some of the technical details behind these improvements.

Cluster Scale

Let us first explain what all metrics compared at cluster scale limits means. What is cluster scale? Here, it is an environment that includes a vCenter server that is configured for the largest vSphere cluster that VMware currently supports, namely 64 hosts and 8,000 powered-on VMs. This setup represents a high-consolidation environment with 125 VMs per host. Note that this is different from the setup used in our previous blog about vCenter 6.5 performance improvements. The setup in that blog was our datacenter scale environment, which used the largest number of supported hosts (2000) and VMs (25,000), so the numbers from that blog should not be compared to these numbers.

2x and 3x

Let us now explain some of the performance numbers quoted. We produced the numbers by measuring workload runs in our cluster scale setup.

2x vCenter Operations Per Second, vSphere 6.7 vs. 6.5, cluster scale limits. We measure operations per second using an internal benchmark called vcbench. We describe vcbench below under “Benchmark Details.” One of the outputs of this workload is management operations (for example, clone, powerOn, vMotion) performed per second.

In 6.5, vCenter was capable of performing approximately 8.3 vcbench operations per second (described below under “Benchmark Details”) in the cluster-scale testbed.
In 6.7, vCenter is now capable of performing approximately 16.7 vcbench operations per second.

3x reduction in memory usage. In addition to our vcbench workload, we also include a simplified workload that simply executes a standard workflow: create a VM, power it on, power it off, and delete it. The rapid powerOn and powerOff of VMs in this setup puts more load on the DRS subsystem than the typical vcbench test.

In 6.5, the core vCenter process (vpxd) used on average about 10 GB to complete the workflow benchmark (described below under “Benchmark Details”).
In 6.7, the core vCenter process used approximately 3 GB to complete this run, while also achieving higher churn (that is, more workflow ‘create/powerOn/powerOff/delete’ cycles completed within the same time period).

3x faster DRS-related operations. In our vcbench workload, we measure not just the overall operations per second, but also the average latencies of individual operations like powerOn (which exercises the majority of the DRS software stack). We issue many concurrent operations, and we measure latency under such load.

In 6.5, the average latency of an individual powerOn during a vcbench run was 9.5 seconds.
In 6.7, the average latency of an individual powerOn during a vcbench run was 2.8 seconds.

The latencies above reflect the fact that a cluster has 8,000 VMs and many operations in flight at once. As a result, individual operations are slower than if they were simply run on a single-host, single-VM environment.

What does this mean to you as a customer?

As a result of these improvements, customers in high-consolidation environments will see reduced resource consumption due to DRS and reduced latency to generate VMotions for load balancing. Customers will also see faster initial placement of VMs.

Brief Deep Dive into Improvements

Before we describe the improvements, let us first briefly explain how DRS works, at a very high level.

When powering on a VM, vCenter must determine where to place the VM. This is called initial placement. Many subsystems, including DRS and policy management, must be consulted to determine valid hosts on which this VM can run. This phase is called constraint check. Once DRS determines the host on which a VM should be powered on, it registers the VM onto that host and issues the powerOn. This initial placement requires a snapshot of the inventory: by snapshot, we mean that DRS records the current configuration of hosts and VMs in the cluster.

In addition to balancing during initial placement, every 5 minutes, DRS re-examines the load of the cluster and performs a series of computations to generate VMotions that help balance the load across hosts. This phase is called periodic rebalancing. Periodic rebalancing requires an examination of the historical utilization statistics for each host and VM (for example, over the previous hour) in order to determine proper placement.

Finally, as VMs get moved around, the used capacity in resource pools changes. The vCenter server periodically exchanges messages called SpecSyncs with each host to push down the most recent resource pool configuration. The SpecSync operation requires traversing a host’s resource pool structure and changing it to make sure it matches vCenter’s configuration.

With this understanding in mind, let us now give some technical details behind the improvements above. As with our previous blog about vCenter performance improvement, we describe changes in terms of rocks (that is, somewhat large changes to entire subsystems) and pebbles (smaller individual changes throughout the code base).

Rocks

The three main rocks that we address in 6.7 are simplified initial placement, faster resource divvying, and faster SpecSyncs.

Simplified initial placement. As mentioned above, initial placement relies on a snapshot of the current state of the inventory. Creating this snapshot can be a heavyweight operation, requiring a number of data copies and locking of host and cluster data structures to ensure a consistent view of the data. In 6.7, we moved to a lightweight online approach that keeps the state up-to-date in a continuous manner, avoiding snapshots altogether. With this approach, we significantly reduce locking demands and significantly reduce the number of times we need to copy data. In some highly-contended clusters, this reduced the initial placement time from seconds down to milliseconds.

Faster (and more frequent) resource divvying. Divvying is the act of determining the resource allocations for each VM. Every five minutes, the state of the cluster is examined and both divvying and then rebalancing (using VMotion) are performed. To make the divvying phase faster, we performed a number of optimizations.

We changed the approach to examining historical usage statistics. Instead of storing metrics for every VM and every host over an hour, we aggregated the data, which allowed us to store a smaller number of metrics per host. This dramatically reduced memory usage and simplified the computation of the desired load for each host.
We restructured the code to remove compatibility checks (for example, those that help to determine which VMs can run on which hosts) during this divvying phase. In 6.5 and earlier, divvying a load also involved various host/VM compatibility calculations. Now, we store the compatible matrix and update it when compatibility changes, which is typically infrequent.
We have also done significant code refactoring (described below under “Pebbles”) to this code path.

By implementing these changes and making divvying faster in 6.7, we are now able to perform divvying more frequently: once per minute instead of once every five minutes. As a result, resources flow more quickly between resource pools, and we are better able to enforce fairness guarantees in DRS clusters.

Note that periodic load balancing (through VMotion) still occurs every five minutes.

Faster SpecSyncs. To perform a SpecSync, vCenter sends a resource pool configuration to a host. The host must examine this configuration and create a list of changes required to bring that host in sync with vCenter. In 6.5 and earlier, depending on the number of VMs, creating this list of changes could result in hundreds of operations on a host, and the runtime was highly variable. In 6.7, we made this computation more deterministic, reducing the number of operations and lowering the latency appropriately.

Pebbles

In addition to the changes above, we also performed a number of optimizations throughout our code base.

Code Refactoring. In 6.5 and before, admission control decisions were made by multiple independent subsystems within vCenter (for example, DRS would be responsible for some decisions, and HA would make others). In 6.7, we simplified our code such that all admission control decisions are handled by a module within DRS. Reducing multiple copies of this code simplifies debugging and reduces resource usage.

Finer-grained locks. In 6.7, we continued to make strides in reducing the scope of our locks. We introduced finer-grained locks so that DRS would not have to lock an entire VM to examine certain pieces of state. We made similar improvements to both hosts and clusters.

Removal of unnecessary classes, maps, and sets. In refactoring our code, we were able to remove a number of classes and thereby reduce the number of copies of data in our system. The maps and sets that were needed to support these classes could also be removed.

Preferring integers over strings. In many situations, we replaced strings and string comparisons with integers and integer comparisons. This dramatically reduces memory allocation overhead and speeds up comparisons, reducing CPU.

Benchmark Details

We measure Operations Per Second (OPS) using a VMware benchmark called vcbench. (For more details about vcbench, see “Benchmarking Details” in vCenter 6.5 Performance: what does 6x mean?) Briefly, vcbench is a java-based application that takes as an input a runlist, which is a list of management operations to perform. These operations include powering on VMs, cloning VMs, reconfiguring VMs, and migrating VMs, among many other operations. We chose the operations based on an analysis of typical customer management scenarios. The vcbench application uses vSphere APIs to issue these operations to vCenter. The benchmark creates multiple threads and issues operations on those threads in parallel (up to 32). We measure the operations per second by taking the total number of operations performed in a given timeframe (say, 1 hour), and dividing it by the time interval.

Our workflow benchmark is very similar to vcbench. The main difference is that more operations are issued per host at a time. In vcbench, one operation is issued per host at a time, while in workflow, up to 8 operations are issued per host at a time.

In many cases, the size of the VM has an impact on operational latency. For example, a VM with a lot of memory (say, 32 GB) and large disks (say, 100 GB) will take longer to clone because more memory will need to be snapshotted, and more disk data will need to be copied. To minimize the impact of the disk subsystem in our measurements, we use small VMs (<4GB memory, < 8GB disk).

Because we limit ourselves to 32 threads per vCenter in this single-cluster setup, throughput numbers are smaller than for our datacenter-at-scale setups (2,000 hosts; 25,000 VMs), which use up to 256 concurrent threads per vCenter.

Summary

In this blog, we have described some of the performance improvements in vCenter from 6.5 to 6.7. A variety of improvements to DRS have led to improved throughput and reduced resource usage for our vcbench workload in a cluster scale setup of 64 hosts and 8,000 powered-on VMs. Many of these changes also apply to larger datacenter-scale setups, although the scope of improvement may not be as pronounced.

Acknowledgments

The vCenter improvements described in this blog are the results of thousands of person-hours from vCenter developers, performance engineers, and others throughout VMware. We are deeply grateful to them for making this happen.

Authors

Zhelong Pan is a senior staff engineer in the Distributed Resource Management Team at VMware. He works on cluster management, including shared resource allocation, VM placement, and load balancing. He is interested in performance optimizations, including virtualization performance and management software performance. He has been at VMware since 2006.

Ravi Soundararajan is a principal engineer in the Performance Group at VMware. He works on vCenter performance and scalability, from the UI to the server to the database to the hypervisor management agents. He has been at VMware since 2003, and he has presented on the topic of vCenter Performance at VMworld from 2013-2017. His Twitter handle is @vCenterPerfGuy.