Home > Blogs > VMware VROOM! Blog > Tag Archives: benchmarking

Tag Archives: benchmarking

New Scheduler Option for vSphere 6.7 U2

Along with the recent release of VMware vSphere 6.7 U2, we published a new whitepaper that shows the performance of a new scheduler option that was included in the 6.7 U2 update.  We referred to this new scheduler option internally as the “sibling” scheduler, but the official name is the side-channel aware scheduler version 2, or SCAv2.  The whitepaper includes full details about SCAv1 and SCAv2, the L1TF security vulnerability that made them necessary, and the performance implications with several different workload types.  This blog is a brief overview of the key points, but we recommend that you check out the full document.

In August of 2018, a security vulnerability known as L1TF, affecting systems using Intel processors, was revealed, and patches and remediations were also made available. Intel provided micro-code updates for its processors, operating system patches were made available, and VMware provided an update for vSphere. The full details of the vCenter and ESXi patches are in a VMware security advisory that links to individual KB articles.

The ESXi-provided patches included a side-channel aware scheduler (SCAv1) that mitigated the concurrent-context attack vector for L1TF. Once that mode was enabled, the scheduler would only schedule processes on one thread for each core. This mode impacted performance mostly from a capacity standpoint because the system was no longer able to use both hyper-threads on a core. A server that was already fully utilized and running at maximum capacity would see a decrease in capacity of up to approximately 30%. A server that was running at 75% of capacity would see a much smaller impact to performance, but CPU utilization would rise.

In vSphere 6.7 U2, the side-channel aware scheduler has been enhanced (SCAv2) with a new policy to allow hyper-threads to be used concurrently if both threads are running vCPU contexts from the same VM. In this way, L1TF side channels are constrained to not expose information across VM/VM or VM/hypervisor boundaries.

Performance testing with several different workloads found a range of impact in performance for both SCAv1 and SCAv2 as compared to the default scheduler as the baseline of performance. If SCAv1 or SCAv2 were able to achieve the same performance, it would be 1.0, and if it achieved 75% of the performance, it would be .75.  The graphs here show the performance impact at max server utilization and the impact at the reduced load of approximately 75% utilization.

The charts show that the SCAv2 scheduler, represented by the third bar in each group, recovers a significant percentage of performance in all cases, except for the monster VM test case.  The monster VM test case was for a single large Oracle database VM that consumed an entire 4 socket host with 192 vCPUs.  In configurations with a single large monster VM that uses all the logical threads of the host, SCAv1 had a slight performance advantage over SCAv2 in our testing.

The reduced load numbers show that at server usage levels of approximately 75%, the overall impact to performance is much lower. With SCAv2 and the overall load below 75%, tests show that the largest performance impact measured in these tests was 11%. The SCAv2 scheduler option, available in vSphere 6.7 U2, provides better performance than SCAv1 in almost all cases.

For full details about the individual benchmark tests as well as more details about L1TF and VMware’s response to it, please see the full whitepaper and VMware KB 55806.

First VMmark 3.1 Publications, Featuring New Cascade Lake Processors

VMmark is a free tool used by hardware vendors and others to measure the performance, scalability, and power consumption of virtualization platforms.  If you’re unfamiliar with VMmark 3.x, each tile is a grouping of 19 virtual machines (VMs) simultaneously running diverse workloads commonly found in today’s data centers, including a scalable Web simulation, an E-commerce simulation (with backend database VMs), and standby/idle VMs.

As Joshua mentioned in a recent blog post, we released VMmark 3.1 in February, adding support for persistent memory, improving workload scalability, and better reflecting secure customer environments by increasing side-channel vulnerability mitigation requirements.

I’m happy to announce that today we published the first VMmark 3.1 results.  These results were obtained on systems meeting our industry-leading side-channel-aware mitigation requirements, thus continuing the benchmark’s ability to provide an indication of real-world performance.

Some mitigations for recently-discovered side-channel vulnerabilities (i.e., Spectre, Meltdown, and L1TF) incur significant performance impacts, while others have little or no impact.  Today’s VMmark results demonstrate that even when additional mitigations are in place, ESXi hosts using the new 2nd-Generation Intel® Xeon® Scalable processors obtain higher VMmark scores than comparable 1st-Generation Intel Xeon Scalable processors.  This is due to processor design improvements that reduce (or even negate) the performance impact of security mitigations, by mitigating some of the security vulnerabilities in hardware rather than in software.

These results, from Fujitsu, span all three VMmark publication categories:

  1. Performance Only (9.02 @ 9 tiles)
  2. Performance with Server Power (6.3290 @ 9 tiles)
  3. Performance with Server and Storage Power (3.5013 @ 9 tiles)

So, how does this new performance result with Cascade Lake processors compare to the previous generation with Skylake processors?  Hopefully a graph is worth a thousand words 😊…

Fujitsu Skylake to Cascade Lake Graph

As you can see, Fujitsu was able to achieve a higher score, while being able to run an additional tile (19 more VMs) and still meeting strict Quality-of-Service (QoS) compliance requirements imposed by the VMmark benchmark harness.

Industry-Leading Side-Channel Mitigation Requirements
Given the numerous security vulnerabilities recently identified, we set a high bar in VMmark 3.1 that requires all applicable security mitigations in benchmarked environments to best represent secure, real-world customer environments.

These are the current security mitigation requirements for VMmark 3.1:

VMmark 3.1 Security Mitigations Table

VMmark 3.1 Security Mitigations Table

Note: If “N/A” is listed, that vulnerability does not apply to that portion of the stack.

For more information about VMmark, please visit the VMmark product page.

If you have any questions or feedback, please leave us a comment below.  Thanks!

IoT Analytics Benchmark adds neural network–based deep learning with Keras and BigDL

The IoT Analytics Benchmark released last year dealt with an important Internet of Things use case—monitoring factory sensor data for impending failure conditions. This year, we are tackling an equally important use case—image classification. Whether used in facial recognition, license plate readers, inspection systems, or autonomous vehicles, neural network–based deep learning is making image detection and classification a viable technology.

As in the classic machine learning used in the original IoT Analytics Benchmark code (which used the Spark Machine Learning Library), the new deep learning code first trains a model using pre-labeled images and then deploys that model to infer the classification of new images. For IoT this inference step is the most important. Thus, the new programs, designated as IoT Analytics Benchmark DL, use previously trained models (included in the kit) to demonstrate inferencing that can be performed at the edge (on small gateway systems) or in scaled-out Spark clusters.

The programs run Keras and Intel’s BigDL image classifiers with the CIFAR10 image set. For each type of classifier, there is a program that sends the images as a series of encoded strings and a second program that reads those strings, converts them back to images, and infers which of the 10 CIFAR10 classes that image belongs to. The Keras classifier is a Python-based single node program for running on an IoT edge gateway. The BigDL classifier is a Spark-based distributed program. The programs use Intel’s BigDL library and the CIFAR10 dataset. (Also see Learning Multiple Layers of Features from Tiny Images, by Alex Krizhevsky.)

The CIFAR10 image set consists of 50,000 pre-labeled training images and 10,000 pre-labeled test images. Each image is a 32 x 32 color image from one of ten classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship or truck. For example, here’s a ship, frog, and truck:

   

Here’s what the Python-based Keras program looks like running a complex ResNet model on a small, virtualized edge gateway system:

First, the inference program is started on the VM on the edge gateway using a pre-trained ResNet model included in the kit:

[root@iotdemo ~]# nc -lk 10000 | python3 infer_cifar.py --modelPath cifar10_ResNet20v1_model_91470.h5
Using TensorFlow backend.
Loaded trained model cifar10_ResNet20v1_model_91470.h5
Start send program
2019-01-31T04:09:37Z: 100 images classified
...
2019-01-31T04:11:06Z: 1000 images classified
Inferenced 1000 images in 99.3 seconds or 10.1 images/second, with 916 or 91.6% correctly classified

Then, when the inference program prints out “Start send program”, the send program is started from a driver system, in this case the author’s Mac:

[djaffe@djaffe-a01 ~/code/neuralnetworks/BigDL]$ python3 send_images_cifar.py -s -i 100 -t 1000 | \
  nc 192.168.2.3 10000
Using TensorFlow backend.
2019-01-31T04:09:12Z: Loading and normalizing the CIFAR10 data
2019-01-31T04:09:22Z: Sending 100 images per second for a total of 1000 images with pixel mean
subtracted
2019-01-31T04:09:31Z: 100 images sent
...
2019-01-31T04:11:00Z: 1000 images sent
2019-01-31T04:11:00Z: Image stream ended

We are planning to use the new workloads in several VMware projects. As always, please send us your feedback and contributions!

vSAN Performance Diagnostics Now Shows “Specific Issues and Recommendations” for HCIBench

By Amitabha Banerjee and Abhishek Srivastava

The vSAN Performance Diagnostics feature, which helps customers to optimize their benchmarks or their vSAN configurations to achieve the best possible performance, was first introduced in vSphere 6.5 U1. vSAN Performance Diagnostics is a “cloud connected” feature and requires participation in the VMware Customer Experience Improvement Program (CEIP). Performance metrics and data are collected from the vSAN cluster and are sent to the VMware Cloud. The data is analyzed and the results are sent back for display in the vCenter Client. These results are shown as performance issues, where each issue includes a problem with its description and a link to a KB article.

In this blog, we describe how vSAN Performance Diagnostics can be used with HCIBench and show the new feature in vSphere 6.7 U1 that provides HCIBench specific issues and recommendations.

What is HCIBench?

HCIBench (Hyper-converged Infrastructure Benchmark) is a standard benchmark that vSAN customers can use to evaluate the performance of their vSAN systems. HCIBench is an automation wrapper around the popular and proven VDbench open source benchmark tool that makes it easier to automate testing across an HCI cluster. HCIBench, available as a fling, simplifies and accelerates customer performance testing in a consistent and controlled way.

Example: Achieving Maximum IOPS

As an example, consider the following HCIBench workload that is run on a vSAN system:

Number of
VMs
Number of
Disks (vmdks)
to Test
Number of
Threads/Disk
Working Set
Percentage
Block Size Read/Write
Percentage
Randomness
Percentage
1 10 1 100 4 KB 0/100 100

If the goal is to achieve maximum IOs per second (IOPS), vSAN Performance Diagnostics for this workload yields the result shown in figure 1.

Figure 1

In this example, vSAN Performance Diagnostics reports, “The Outstanding IOs for the benchmark is too low to achieve the desired goal”. Here we can see that the feedback from vSAN Performance Diagnostics tells us about the problem, and a possible solution recommends that we increase the number of outstanding IOs to a value of 2 per host. The linked “Ask VMware” article explains this issue and what we must do with the benchmark in more detail.

The new HCIBench Exceptions and Recommendations feature removes the need to read through a KB article by precisely mapping recommendations to one (or more) configurable parameters of HCIBench.

Now let us check how vSAN Performance Diagnostics works with vSphere 6.7 U1, which has access to this new feature (figure 2).

Figure 2

Now we can clearly see an exact issue in our HCIBench workload configuration that provides the precise recommendation to resolve this issue with the message: “Increase number of threads per disk from 1 to 2”.

Let us monitor the current write IOPS generated by the benchmark for our reference. We go to Data Center–>Cluster–>Monitor–>vSAN–>Performance (figure 3).

Figure 3

Now, as part of the evaluation, we apply the recommendation generated by vSAN Performance Diagnostics and check its impact. We now run the new HCIBench workload configuration with 2 threads per disk and the following parameters:

Number of
VMs
Number of
Disks (vmdks)
to Test
Number of
Threads/Disk
Working Set
Percentage
Block Size Read/Write
Percentage
Randomness
Percentage
1 10 2 100 4KB 0/100 100

After the benchmark completes, we use vSAN Performance Diagnostics again to see if we can now achieve the required goal of maximum IOPS. The result from Performance Diagnostics now shows “No Issues were found”, which means that we are achieving good IOPS from the vSAN system (figure 4).

Figure 4

Now, as part of actual verification, we can see the change in IOPS after applying the recommendation. From figure 5 below (screenshot from Data Center–>Cluster–>Monitor–>vSAN–>Performance), we can clearly see that there is a 25-30% increase in IOPS after applying the recommendation, which verifies that it helped us achieve our goal.

Figure 5

We believe that this feature will be very useful for customers who want to tune their HCIBench workload for a desired goal.

Prerequisites

  • Please note that this feature is currently integrated with HCIBench 1.6.7 or later.
  • This feature is available for vSphere 6.7 U1 and newer releases. It will not be available to patch releases of vSphere 6.7.

SPBM compliance check just got faster in vSphere 6.7 U1!

vSphere 6.7 U1 includes several enhancements in Storage Policy-Based Management (SPBM) to significantly reduce CPU use and generate a much faster response time for compliance checking operations.

SPBM is a framework that allows vSphere users to translate their workload’s storage requirements into rules called storage policies. Users can apply storage policies to virtual machines (VMs) and virtual machine disks (VMDKs) using the vSphere Client or through the VMware Storage Policy API’s rich set of managed objects and methods. One such managed object is PbmComplianceManager. One of its methods, PbmCheckCompliance, helps users determine whether or not the storage policy attached to their VM is being honored.

PbmCheckCompliance is automatically invoked soon after provisioning operations such as creating, cloning, and relocating a VM. It is also automatically triggered in the background once every 8 hours to help keep the compliance records up-to-date.

In addition, users can invoke the method when checking compliance for a VM storage policy in the vSphere Client, or through the VMware Storage Policy API method PbmCheckCompliance.

We did a study in our lab to compare the performance of PbmCheckCompliance between vSphere 6.5 U2 and vSphere 6.7 U1. We present this comparison in the form of charts showing the latency (normalized on a 100-point scale) of PbmCheckCompliance for varying numbers of VMs.

The following chart compares the performance of PbmCheckCompliance on VMFS and vSAN environments.

As we see from the above chart, PbmCheckCompliance returns results much faster in vSphere 6.7 U1 compared to 6.5 U2. The improvement is seen across all inventory sizes and all datastore types and become more prominent for larger inventories and higher numbers of VMs.

The enhancements also positively impact a similar method, PbmCheckRollupCompliance. This method also returns the compliance status of VMs and adds compliance results for all disks associated with these VMs. The following chart represents the performance comparison of PbmCheckRollupCompliance on VMFS and vSAN environments.

Our experiments show that compliance check operations are significantly faster and more light-weight in vSphere 6.7 U1.

New white paper: Big Data performance on VMware Cloud on AWS: Spark machine learning and IoT analytics performance on-premises and in the cloud

By Dave Jaffe

A new white paper is available comparing Spark machine learning performance on an 8-server on-premises cluster vs. a similarly configured VMware Cloud on AWS cluster.

Here is what the VMware Cloud on AWS cluster looked like:

Screenshot of cluster configuration

VMware Cloud on AWS configuration for performance tests

Three standard analytic programs from the Spark machine learning library (MLlib), K-means clustering, Logistic Regression classification, and Random Forest decision trees, were driven using spark-perf. In addition, a new, VMware-developed benchmark, IoT Analytics Benchmark, which models real-time machine learning on Internet-of-Things data streams, was used in the comparison. The benchmark is available from GitHub.

As seen in the charts below, performance was very similar on-premises and on VMware Cloud on AWS.

Spark machine learning performance chart

Spark machine learning performance

IoT Analytics performance chart

IoT Analytics performance

 

All details are in the paper.

Persistent Memory Performance in vSphere 6.7

We published a paper that shows how VMware is helping advance PMEM technology by driving the virtualization enhancements in vSphere 6.7. The paper gives a detailed performance analysis of using PMEM technology on vSphere using various workloads and scenarios.

These are the key points that we cover in this white paper:

  • We explain how PMEM can be configured and used in a vSphere environment.
  • We show how applications with different characteristics can take advantage of PMEM in vSphere. Below are some of the use-cases:
    • How PMEM device limits can be achieved under vSphere with little to no overhead of virtualization. We show virtual-to-native ratio along with raw bandwidth and latency numbers from fio, an I/O microbenchmark.
    • How traditional relational databases like Oracle can benefit from using PMEM in vSphere.
    • How scaling-out VMs in vSphere can benefit from PMEM. We used Sysbench with MySQL to show such benefits.
    • How modifying applications (PMEM-aware) can get the best performance out of PMEM. We show performance data from such applications, e.g., an OLTP database like SQL Server and an in-memory database like Redis.
    • Using vMotion to migrate VMs with PMEM which is a host-local device just like NVMe SSDs. We also characterize in detail, vMotion performance of VMs with PMEM.
  • We outline some best practices on how to get the most out of PMEM in vSphere.

Read the full paper here.

Oracle Database Performance with VMware Cloud on AWS

You’ve probably already heard about VMware Cloud on Amazon Web Services (VMC on AWS). It’s the same vSphere platform that has been running business critical applications for years, but now it’s available on Amazon’s cloud infrastructure. Following up on the many tests that we have done with Oracle databases on vSphere, I was able to get some time on a VMC on AWS setup to see how Oracle databases perform in this new environment.

It is important to note that VMC on AWS is vSphere running on bare metal servers in Amazon’s infrastructure. The expectation is that performance will be very similar to “regular” onsite vSphere, with the added advantage that the hardware provisioning, software installation, and configuration is already done and the environment is ready to go when you login. The vCenter interface is the same, except that it references the Amazon instance type for the server.

Our VMC on AWS instance is made up of four ESXi hosts. Each host has two 18-core Intel Xeon E5-2686 v4 (aka Broadwell) processors and 512 GB of RAM. In total, the cluster has 144 cores and 2 TB of RAM, which gives us lots of physical resources to utilize in the cloud.

In our test, the database VMs were running Red Hat Enterprise Linux 7.2 with Oracle 12c. To drive a load against the database VMs, a single 18 vCPU driver VM was running Windows Server 2012 R2, and the DVD Store 3 test workload was also setup on the cluster. A 100 GB test DS3 database was created on each of the Oracle database VMs. During testing, the number of threads driving load against the databases were increased until maximum throughput was achieved, which was around 95% CPU utilization. The total throughput across all database servers for each test is shown below.

 

In this test, the DB VMs were configured with 16 vCPUs and 128 GB of RAM. In the 8 VMs test case, a total of 128 vCPUs were allocated across the 144 cores of the cluster. Additionally the cluster was also running the 18 vCPU driver VM,  vCenter, vSAN, and NSX. This makes the 12 VM test case interesting, where there were 192 vCPUs for the DB VMs, plus 18 vCPUs for the driver. The hyperthreads clearly help out, allowing for performance to continue to scale, even though there are more vCPUs allocated than physical cores.

The performance itself represents scaling very similar to what we have seen with Oracle and other database workloads with vSphere in recent releases. The cluster was able to achieve over 370 thousand orders per minute with good scaling from 1 VM to 12 VMs. We also recently published similar tests with SQL Server on the same VMC on AWS cluster, but with a different workload and more, smaller VMs.

UPDATE (07/30/2018): The whitepaper detailing these results is now available here.

vCenter performance improvements from vSphere 6.5 to 6.7: What does 2x mean?

In a recent blog, the VMware vSphere team shared the following performance improvements in vSphere 6.7 vs. 6.5:

Moreover, with vSphere 6.7 vCSA delivers phenomenal performance improvements (all metrics compared at cluster scale limits, versus vSphere 6.5):
2X faster performance in vCenter operations per second
3X reduction in memory usage
3X faster DRS-related operations (e.g. power-on virtual machine)

As senior engineers within the VMware Performance and vSphere teams, we are writing this blog to provide more details regarding these numbers and to explain how we measured them. We also briefly explain some of the technical details behind these improvements.

Cluster Scale

Let us first explain what all metrics compared at cluster scale limits means. What is cluster scale? Here, it is an environment that includes a vCenter server that is configured for the largest vSphere cluster that VMware currently supports, namely 64 hosts and 8,000 powered-on VMs. This setup represents a high-consolidation environment with 125 VMs per host. Note that this is different from the setup used in our previous blog about vCenter 6.5 performance improvements. The setup in that blog was our datacenter scale environment, which used the largest number of supported hosts (2000) and VMs (25,000), so the numbers from that blog should not be compared to these numbers.

2x and 3x

Let us now explain some of the performance numbers quoted. We produced the numbers by measuring workload runs in our cluster scale setup.

2x vCenter Operations Per Second, vSphere 6.7 vs. 6.5, cluster scale limits. We measure operations per second using an internal benchmark called vcbench. We describe vcbench below under “Benchmark Details.” One of the outputs of this workload is management operations (for example, clone, powerOn, vMotion) performed per second.

  • In 6.5, vCenter was capable of performing approximately 8.3 vcbench operations per second (described below under “Benchmark Details”) in the cluster-scale testbed.
  • In 6.7, vCenter is now capable of performing approximately 16.7 vcbench operations per second.

3x reduction in memory usage. In addition to our vcbench workload, we also include a simplified workload that simply executes a standard workflow: create a VM, power it on, power it off, and delete it. The rapid powerOn and powerOff of VMs in this setup puts more load on the DRS subsystem than the typical vcbench test.

  • In 6.5, the core vCenter process (vpxd) used on average about 10 GB to complete the workflow benchmark (described below under “Benchmark Details”).
  • In 6.7, the core vCenter process used approximately 3 GB to complete this run, while also achieving higher churn (that is, more workflow ‘create/powerOn/powerOff/delete’ cycles completed within the same time period).

3x faster DRS-related operations. In our vcbench workload, we measure not just the overall operations per second, but also the average latencies of individual operations like powerOn (which exercises the majority of the DRS software stack). We issue many concurrent operations, and we measure latency under such load.

  • In 6.5, the average latency of an individual powerOn during a vcbench run was 9.5 seconds.
  • In 6.7, the average latency of an individual powerOn during a vcbench run was 2.8 seconds.

The latencies above reflect the fact that a cluster has 8,000 VMs and many operations in flight at once. As a result, individual operations are slower than if they were simply run on a single-host, single-VM environment.

What does this mean to you as a customer?

As a result of these improvements, customers in high-consolidation environments will see reduced resource consumption due to DRS and reduced latency to generate VMotions for load balancing. Customers will also see faster initial placement of VMs.

Brief Deep Dive into Improvements

Before we describe the improvements, let us first briefly explain how DRS works, at a very high level.

When powering on a VM, vCenter must determine where to place the VM. This is called initial placement. Many subsystems, including DRS and policy management, must be consulted to determine valid hosts on which this VM can run. This phase is called constraint check. Once DRS determines the host on which a VM should be powered on, it registers the VM onto that host and issues the powerOn. This initial placement requires a snapshot of the inventory: by snapshot, we mean that DRS records the current configuration of hosts and VMs in the cluster.

In addition to balancing during initial placement, every 5 minutes, DRS re-examines the load of the cluster and performs a series of computations to generate VMotions that help balance the load across hosts. This phase is called periodic rebalancing. Periodic rebalancing requires an examination of the historical utilization statistics for each host and VM (for example, over the previous hour) in order to determine proper placement.

Finally, as VMs get moved around, the used capacity in resource pools changes. The vCenter server periodically exchanges messages called SpecSyncs with each host to push down the most recent resource pool configuration. The SpecSync operation requires traversing a host’s resource pool structure and changing it to make sure it matches vCenter’s configuration.

With this understanding in mind, let us now give some technical details behind the improvements above. As with our previous blog about vCenter performance improvement, we describe changes in terms of rocks (that is, somewhat large changes to entire subsystems) and pebbles (smaller individual changes throughout the code base).

Rocks

The three main rocks that we address in 6.7 are simplified initial placement, faster resource divvying, and faster SpecSyncs.

Simplified initial placement. As mentioned above, initial placement relies on a snapshot of the current state of the inventory. Creating this snapshot can be a heavyweight operation, requiring a number of data copies and locking of host and cluster data structures to ensure a consistent view of the data. In 6.7, we moved to a lightweight online approach that keeps the state up-to-date in a continuous manner, avoiding snapshots altogether. With this approach, we significantly reduce locking demands and significantly reduce the number of times we need to copy data. In some highly-contended clusters, this reduced the initial placement time from seconds down to milliseconds.

Faster (and more frequent) resource divvying. Divvying is the act of determining the resource allocations for each VM. Every five minutes, the state of the cluster is examined and both divvying and then rebalancing (using VMotion) are performed. To make the divvying phase faster, we performed a number of optimizations.

  • We changed the approach to examining historical usage statistics. Instead of storing metrics for every VM and every host over an hour, we aggregated the data, which allowed us to store a smaller number of metrics per host. This dramatically reduced memory usage and simplified the computation of the desired load for each host.
  • We restructured the code to remove compatibility checks (for example, those that help to determine which VMs can run on which hosts) during this divvying phase. In 6.5 and earlier, divvying a load also involved various host/VM compatibility calculations. Now, we store the compatible matrix and update it when compatibility changes, which is typically infrequent.
  • We have also done significant code refactoring (described below under “Pebbles”) to this code path.

By implementing these changes and making divvying faster in 6.7, we are now able to perform divvying more frequently: once per minute instead of once every five minutes. As a result, resources flow more quickly between resource pools, and we are better able to enforce fairness guarantees in DRS clusters.

Note that periodic load balancing (through VMotion) still occurs every five minutes.

Faster SpecSyncs. To perform a SpecSync, vCenter sends a resource pool configuration to a host. The host must examine this configuration and create a list of changes required to bring that host in sync with vCenter. In 6.5 and earlier, depending on the number of VMs, creating this list of changes could result in hundreds of operations on a host, and the runtime was highly variable. In 6.7, we made this computation more deterministic, reducing the number of operations and lowering the latency appropriately.

Pebbles

In addition to the changes above, we also performed a number of optimizations throughout our code base.

Code Refactoring. In 6.5 and before, admission control decisions were made by multiple independent subsystems within vCenter (for example, DRS would be responsible for some decisions, and HA would make others). In 6.7, we simplified our code such that all admission control decisions are handled by a module within DRS. Reducing multiple copies of this code simplifies debugging and reduces resource usage.

Finer-grained locks. In 6.7, we continued to make strides in reducing the scope of our locks. We introduced finer-grained locks so that DRS would not have to lock an entire VM to examine certain pieces of state. We made similar improvements to both hosts and clusters.

Removal of unnecessary classes, maps, and sets. In refactoring our code, we were able to remove a number of classes and thereby reduce the number of copies of data in our system. The maps and sets that were needed to support these classes could also be removed.

Preferring integers over strings. In many situations, we replaced strings and string comparisons with integers and integer comparisons. This dramatically reduces memory allocation overhead and speeds up comparisons, reducing CPU.

Benchmark Details

We measure Operations Per Second (OPS) using a VMware benchmark called vcbench. (For more details about vcbench, see “Benchmarking Details” in vCenter 6.5 Performance: what does 6x mean?) Briefly, vcbench is a java-based application that takes as an input a runlist, which is a list of management operations to perform. These operations include powering on VMs, cloning VMs, reconfiguring VMs, and migrating VMs, among many other operations. We chose the operations based on an analysis of typical customer management scenarios. The vcbench application uses vSphere APIs to issue these operations to vCenter. The benchmark creates multiple threads and issues operations on those threads in parallel (up to 32). We measure the operations per second by taking the total number of operations performed in a given timeframe (say, 1 hour), and dividing it by the time interval.

Our workflow benchmark is very similar to vcbench. The main difference is that more operations are issued per host at a time. In vcbench, one operation is issued per host at a time, while in workflow, up to 8 operations are issued per host at a time.

In many cases, the size of the VM has an impact on operational latency. For example, a VM with a lot of memory (say, 32 GB) and large disks (say, 100 GB) will take longer to clone because more memory will need to be snapshotted, and more disk data will need to be copied. To minimize the impact of the disk subsystem in our measurements, we use small VMs (<4GB memory, < 8GB disk).

Because we limit ourselves to 32 threads per vCenter in this single-cluster setup, throughput numbers are smaller than for our datacenter-at-scale setups (2,000 hosts; 25,000 VMs), which use up to 256 concurrent threads per vCenter.

Summary

In this blog, we have described some of the performance improvements in vCenter from 6.5 to 6.7. A variety of improvements to DRS have led to improved throughput and reduced resource usage for our vcbench workload in a cluster scale setup of 64 hosts and 8,000 powered-on VMs. Many of these changes also apply to larger datacenter-scale setups, although the scope of improvement may not be as pronounced.

Acknowledgments

The vCenter improvements described in this blog are the results of thousands of person-hours from vCenter developers, performance engineers, and others throughout VMware. We are deeply grateful to them for making this happen.

Authors

Zhelong Pan is a senior staff engineer in the Distributed Resource Management Team at VMware. He works on cluster management, including shared resource allocation, VM placement, and load balancing. He is interested in performance optimizations, including virtualization performance and management software performance. He has been at VMware since 2006.

Ravi Soundararajan is a principal engineer in the Performance Group at VMware. He works on vCenter performance and scalability, from the UI to the server to the database to the hypervisor management agents. He has been at VMware since 2003, and he has presented on the topic of vCenter Performance at VMworld from 2013-2017. His Twitter handle is @vCenterPerfGuy.

SQL Server Performance of VMware Cloud on AWS

In the past, I’ve always benchmarked performance of SQL Server VMs on vSphere with “on-premises” infrastructure.  Given the skyrocketing interest in the cloud, I was very excited to get my hands on VMware Cloud on AWS – just in time for Amazon’s AWS Summit!

A key question our customers have is: how well do applications (like SQL Server) perform in our cloud?  Well, I’m happy to report that the answer is great!

VMware Cloud on AWS Environment

First, here is a screenshot of what my vSphere-powered Software-Defined Data Center (SDDC) looks like:vSphere Client - VMware Cloud on AWSThis screenshot shows several notable items:

  • The HTML5-based vSphere Client interface should be very familiar to vSphere administrators, making the move to the cloud extremely easy
  • This SDDC instance was auto-provisioned with 4 ESXi hosts and 2TB of memory, all of which were pre-configured with vSAN storage and NSX networking.
    • Each host is configured with two CPUs (Intel Xeon Processor E5-2686 v4); each socket contains 18 cores running at 2.3GHz, resulting in 144 physical cores in the cluster. For more information, see the VMware Cloud on AWS Technical Overview
  • Virtual machines are provisioned within the customer workload resource pool, and vSphere DRS automatically handles balancing the VMs across the compute cluster.

Benchmark Methodology

To measure SQL Server database performance, I used HammerDB, an open-source database load testing and benchmarking tool.  It implements a TPC-C like workload, and reports throughput in TPM (Transactions Per Minute).

To measure how well performance scaled in this cloud, I started with a single 8 vCPU, 32GB RAM VM for the SQL Server database.  To drive the workload, I created a 4 vCPU, 4GB RAM HammerDB driver VM.  I then cloned these VMs to measure 2 database VMs being driven simultaneously:HammerDB and SQL Server VMs in VMware Cloud on AWS

I then doubled the number of VMs again to 4, 8, and finally 16.  As with any benchmark, these VMs were completely driven up to saturation (100% load) – “pedal to the metal”!

Results

So, how did the results look?  Well, here is a graph of each VM count and the resulting database performance:

As you can see, database performance scaled great; when running 16 8-vCPU VMs, VMware Cloud on AWS was able to sustain 6.7 million database TPM!

I’ll be detailing these benchmarks more in an upcoming whitepaper, but wanted to share these results right away.  If you have any questions or feedback, please leave me a comment!

UPDATE (07/25/2018): The whitepaper detailing these results is now available here.