Home > Blogs > VMware VROOM! Blog > Category Archives: Web/Tech

Category Archives: Web/Tech

Weathervane, a benchmarking tool for virtualized infrastructure and the cloud, is now open source.

Weathervane is a performance benchmarking tool developed at VMware.  It lets you assess the performance of your virtualized or cloud environment by driving a load against a realistic application and capturing relevant performance metrics.  You might use it to compare the performance characteristics of two different environments, or to understand the performance impact of some change in an existing environment.

Weathervane is very flexible, allowing you to configure almost every aspect of a test, and yet is easy to use thanks to tools that help prepare your test environment and a powerful run harness that automates almost every aspect of your performance tests.  You can typically go from a fresh start to running performance tests with a large multi-tier application in a single day.

Weathervane supports a number of advanced capabilities, such as deploying multiple independent application instances, deploying application services in containers, driving variable loads, and allowing run-time configuration changes for measuring elasticity-related performance metrics.

Weathervane has been used extensively within VMware, and is now open source and available on GitHub at https://github.com/vmware/weathervane.

The rest of this blog gives an overview of the primary features of Weathervane.

Weathervane Overview

Weathervane is an application-level benchmarking tool.  It allows you to place a controlled load on a computing environment by deploying a realistic application, and then simulating users interacting with the application. In the case of Weathervane, the application is a scalable Java web application which implements a real-time auction web site, and the environment can be anything from a single bare-metal server to a large virtualized cluster of servers or a public or private cloud.   You can use collected metrics to evaluate the performance of the environment or to investigate the effect of changes in the environment.  For example, you could use Weathervane to compare the performance of different cloud environments, or to evaluate the impact of changing storage technologies on application-level performance.

Weathervane consists of three main components and a number of supporting tools:

  • The Auction application to be deployed on the environment under test.
  • A workload driver that can drive a realistic and repeatable load against the application.
  • A run harness which automates the process of executing runs and collecting results, logs, and relevant performance data.
  • Supporting tools include scripts to set up an operating system instance with all of the software needed to run Weathervane, to create Docker images for the application services, and to load and prepare the application data needed for a run of the benchmark.

Figure 1 shows the logical layout of the main components of a Weathervane deployment. Additional background about the components of Weathervane can be found at https://blogs.vmware.com/performance/2015/03/introducing-weathervane-benchmark.html.

Figure 1 Weathervane Deployment


The design goal for Weathervane has been to provide flexibility so that you can adapt it to suit the needs a wide range of performance evaluation tasks. You can customize almost every aspect of a Weathervane deployment. You can vary …

  • … the number of instances in each service tier. For example, there can be any number of load balancer, application server, or web server nodes.  This allows you to create very small or very large configurations as needed.
  • … the number of tiers. For example, it is possible to omit the load-balancer and/or web server tiers if only a small application deployment is needed.
  • … the implementation to be used for each service. For example, Weathervane currently supports PostgreSQL and MySQL as the transactional database, and Apache Httpd and Nginx as the web server.  Because Weathervane is open source, you can add additional implementations of a service type if desired.
  • … the tuning and configuration of the services. The most common performance and configuration tunings for each service tier can be set in the Weathervane configuration file.  The run harness will then apply the tunings to each service instance automatically when you perform a run of the benchmark.

The Weathervane run harness makes it easy for you to take advantage of this flexibility by managing the complexity involved in configuring and starting the application, running the workload, and collecting performance results.  In many cases, you can make complex changes in a Weathervane deployment with just a few simple changes in a configuration file.

Weathervane also provides advanced features that allow you to evaluate the performance impact of many important issues in large virtualized and cloud environments.

  • You can run the service instances of the Auction application directly on the OS, either in a virtual machine or on a bare-metal server, or deploy them in Docker containers. Weathervane comes with scripts to create Docker images for all of the application services.
  • You can run and drive load against multiple independent instances of the Auction application. This is useful if you want to investigate the interactions between multiple independent applications, or when it is necessary to drive loads larger than can be handled by a single application instance.  The configuration and load for each instance can be specified independently.
  • You can specify a user load for the application instances that varies over the course of run. This allows you to investigate performance issues related to bursty loads, cyclical usage patterns, and, as discussed next, the impact of application elasticity.
  • The Action application supports changing the number of instances of some service tiers at run-time to support application-level elasticity (https://en.wikipedia.org/wiki/Elasticity_(cloud_computing)), and the study of elasticity-related performance metrics. Weathervane currently includes a scheduled elasticity-service, which allows you to specify changes in the application configuration over time.  When used in combination with the time-varying loads, this enables the investigation of some elasticity-related performance issues.  Future implementations of the elasticity service will use real-time monitoring to make decisions about configuration changes.

Over the coming months, we will be publishing additional posts demonstrating the use of these features.

Future Direction

We intend to continue to grow and improve Weathervane’s applicability and ease of use.  We also plan to focus on improving its usefulness as a platform for examining the meaning of performance evaluation in the cloud.  As an open source project, we invite the wider performance community to not only use it, but to participate in extending it and shaping its future direction.  Enhancements to Weathervane may include adding new performance metrics, better metric reporting and real-time monitoring, support for additional services such as cloud-vendor specific databases and load balancers, and even adding additional applications to be deployed on the environment under test.  Visit the Weathervane GitHub repository at  https://github.com/vmware/weathervane for more information about getting involved.

The Weathervane team would like to thank the VMware Open-Source Program Office, https://blogs.vmware.com/opensource/, and VMware’s commitment to open-source software, for helping to make this release possible.

DRS Cluster Resource Management with Reservation and Shares

Resource management in vSphere is very robust and sophisticated. It provides settings like reservation, limits and shares (a.k.a. RLS), which can be applied on virtual machines (VMs) and resource pools. These settings can be used as knobs to isolate resources, prioritize them, or to guarantee availability of resources to VMs in a cluster. Our customers have often asked about how these settings impact availability of resources for a VM.


Recently we published a white paper based on our study, to explain how reservation and shares can impact resource availability for VMs when they are applied on VMs versus when they are applied on resource pools. In the paper, we also share some general guidelines on when and how these settings can be used.

vCenter 6.5 Performance: what does 6x mean?

At the VMworld 2016 Barcelona keynote, CTO Ray O’Farrell proudly presented the performance improvements in vCenter 6.5. He showed the following slide:


Slide from Ray O’Farrell’s keynote at VMworld 2016 Barcelona, showing 2x improvement in scale from 6.0 to 6.5 and 6x improvement in throughput from 5.5 to 6.5.

As a senior performance engineer who focuses on vCenter, and as one of the presenters of VMworld Session INF8108 (listed in the top-right corner of the slide above), I have received a number of questions regarding the “6x” and “2x scale” labels in the slide above. This blog is an attempt to explain these numbers by describing (at a high level) the performance improvements for vCenter in 6.5. I will focus specifically on the vCenter Appliance in this post.

6x and 2x

Let’s start by explaining the “6x” and the “2x” from the keynote slide.

  1. 6x:  We measure performance in operations per second, where operations include powering on VMs, clones, VMotions, etc. More details are presented below under “Benchmarking Details.” The “6x” refers to a sixfold increase in operations per second from vSphere 5.5 to 6.5:
    1. In 5.5, vCenter was capable of approximately 10 operations per second in our testbed.
    2. In 6.0, vCenter was capable of approximately 30 operations per second in our testbed.
    3. In 6.5 vCenter can now perform more than 60 operations per second in our testbed. With faster hardware, vCenter can achieve over 70 operations per second in our testbed.
  2. 2x: The 2x improvement refers to a change in the supported limits for vCenter. The number of hosts has doubled, and the number of VMs has more than doubled:
    1. The supported limits for a single instance of vCenter 6.0 are 1000 hosts, 15000 registered VMs, and 10000 powered-on VMs.
    2. The supported limits for a single instance of vCenter 6.5 are 2000 hosts, 35000 registered VMs, and 25000 powered-on VMs.

Not only are the supported limits higher in 6.5, but the resources required to support such a limit are dramatically reduced.

What does this mean to you as a customer?

The numbers above represent what we have measured in our labs. Clearly, configurations will vary from customer to customer, and observed improvements will differ. In this section, I will give some examples of what we have observed to illustrate the sorts of gains a customer may experience.

PowerOn VM

Before powering on a VM, DRS must collect some information and determine a host for the VM. In addition, both the vCenter server and the ESX host must exchange some information to confirm that the powerOn has succeed and must record the latest configuration of the VM. By a series of optimizations in DRS related to choosing hosts, and by a large number of code optimizations to reduce CPU usage and reduce critical section time, we have seen improvements of up to 3x for individual powerOns in a DRS cluster. We give an example in the figure below, in which we show the powerOn latency (normalized to the vSphere 6.0 latency, lower is better).


Example powerOn latency for 6.0 vs. 6.5, normalized to 6.0. Lower is better. 6.5 outperforms 6.0. The gains are due primarily to improvements in DRS and general code optimizations.

The benefits are likely to be most prominent in large clusters (i.e., 64 hosts and 8000 VMs in a cluster), although all cluster sizes will benefit from these optimizations.

Clone VM

Prior to cloning a VM, vCenter does a series of calls to check compatibility of the VM on the destination host, and it also validates the input parameters to the clone. The bulk of the latency for a clone is typically in the disk subsystem of the vSphere hosts. For our testing, we use small VMs (as described below) to allow us to focus on the vCenter portion of latency. In our tests, due to efficiency improvements in the compatibility checks and in the validation steps, we see up to 30% improvement in clone latency, as seen in the figure below, which depicts normalized clone latency for one of our tests.

Example clone VM latency for 6.0 vs. 6.5, normalized to 6.0. Lower is better. 6.5 outperforms 6.0. The gains are in part due to code improvements where we determine which VMs can run on which hosts.

These gains will be most pronounced when the inventory is large (several thousand VMs) or when the VMs to be cloned are small (i.e., < 16GB). For larger VMs, the latency to copy the VM over the network and the latency to write the VM to disk will dominate over the vCenter latency.

VMotion VM

For a VMotion of a large VM, the costs of pre-copying memory pages and then transferring dirty pages typically dominates. With small VMs (4GB or less), the costs imposed by vCenter are similar to those in the clone operation: checking compatibility of the VM with the new host, whether it be the datastore, the network, or the processor. In our tests, we see approximately 15% improvement in VMotion latency, as shown here:

Example VMotion latency for 6.0 vs. 6.5, normalized to 6.0. Lower is better. 6.5 is slightly better than 6.0. The gains are due in part to general code optimizations in the vCenter server.

As with clone, the bulk of these improvements is from a large number of code optimizations to improve CPU and memory efficiency in vCenter. Similar to clone, the improvements are most pronounced with large numbers of VMs or when the VMs are less than 4GB.

Reconfigure VM

Our reconfiguration operation changes the memory share settings for a VM. This requires a communication with the vSphere host followed by updates to the vCenter database to store new settings. While there have been improvements along each of these code paths, the overall latency is similar from 6.0 to 6.5, as shown in the figure below.


Example latency for VM reconfigure task for 6.0 vs. 6.5, normalized to 6.0. Lower is better. The performance is approximately the same from 6.0 to 6.5 (the difference is within experimental error).

Note that the slight increase in 6.5 is within the experimental error for our setup, so for this particular test, the reconfigure operation is basically the same from 6.0 to 6.5.

The previous data were for a sampling of operations, but our efficiency improvements should result in speedups for most operations, whether called through the UI or through our APIs.

Resource Usage and Restart Time of vCenter Server

In addition to the sorts of gains shown above, the improvements from 6.0 to 6.5 have also dramatically reduced the resource usage of the vCenter server. These improvements are described in more detail below, and we give one example here. For an environment in our lab consisting of a single vCenter server managing 64 Hosts and 8,000 VMs, the overall vCenter server resource usage dropped from 27GB down to 14GB. The drop is primarily due to removal of inventory service and optimizations in the core vpxd process of vCenter (especially with respect to DRS).

In our labs, the optimizations described below have also reduced the the restart time of vCenter (the time from when the machine hosting vCenter is booted until vCenter can accept API or UI requests). The impact depends on the extensions installed and the amount of data to be loaded at startup by the web client (in the case of accepting UI requests), but we have seen improvements greater than 3x in our labs, and anecdotal evidence from the field suggests larger improvements.

Brief Deep Dive into Improvements

The previous section has shown the types of improvements one might expect over different kinds of operations in vCenter. In this section, we briefly describe some of the code changes that resulted in these improvements.

 “Rocks” and “Pebbles”

The changes from 6.0 to 6.5 can be divided into large, architectural-type changes (so-called “Rocks” because of the large size of the changes) and a large number of smaller optimizations (so-called “Pebbles” because the changes themselves are smaller).


There are three main “Rocks” that have led to performance improvements from 6.0 to 6.5:

  1. Removal of Inventory Service
  2. Significant optimizations to CPU and memory usage in DRS, specifically with respect to snapshotting inventory for compatibility checks and initial placement of VMs upon powerOn.
  3. Change from SLES11 (6.0) to PhotonOS (6.5).

Inventory Service. The Inventory Service was originally added to vCenter in the 5.0 release in order to provide a caching layer for inventory data. Clients to vCenter (like the web client) could then retrieve data from the inventory service instead of going to the vpxd process within vCenter. Second- and Third-party solutions (e.g., vROps or other solutions) could store data in this inventory service so that the web client could easily retrieve such data This inventory service was implemented in Java and was backed by an embedded database. While this approach has some benefits with respect to reducing load to vCenter, the cost of maintaining this cache was far higher than its benefits. In particular, in the largest supported setups of vCenter, the memory cost of this service was nearly 16GB, and could be even larger in customer deployments. Maintaining the embedded database also required significant disk IO (nearly doubling the overall IO in vCenter) and CPU. In 6.5, we have removed this Inventory Service and instead have employed a new design that efficiently retrieves directly from vpxd. With the significant improvements to the vpxd process, this approach is much faster than using the Inventory Service. Moreover, it saves nearly 16GB from our largest setups. Finally, removing Inventory Service also leads to faster restart times for the vCenter server, since the Inventory Service no longer has to synchronize its data with the core vpxd process of vCenter server before vCenter has finished starting up. In our test labs, the restart times (the time from reboot until vCenter can accept client requests) improved by up to 3x, from a few minutes down to around one minute.

DRS. Our performance results had suggested the DRS adds some overhead when computing initial placement and ongoing placement of VMs. When doing this computation, DRS needs to retrieve the current state of the inventory. A significant effort was undertaken in 6.5 to reduce this overhead. The sizes of the snapshots were reduced, and the overhead of taking such a snapshot was dramatically reduced. One additional source of overhead is doing the compatibility checks required to determine if a VM is able to run on a given host. This code was dramatically simplified while still preserving the appropriate load-balancing capabilities of DRS.
The combination of simplifying DRS and removing Inventory Service resulted in significant resource usage reductions. To give a concrete example, in our labs, to support the maximum supported inventory of a 6.0 setup (1000 hosts and 15000 registered VMs) required approximately 27GB, while the same size inventory required only 14GB in 6.5.

PhotonOS. The final “Rock” that I will describe is the change from SLES11 to PhotonOS. PhotonOS uses a much more recent version of the Linux Kernel (4.4 vs. 3.0 for SLES11). With much newer libraries, and with a slimmer set of default modules installed in the base image, PhotonOS has proven to be a more efficient guest OS for the vCenter Server Appliance. In addition to these changes, we have also tuned some settings that have given us performance improvements in our labs (for example, changing some of the default NUMA parameters and ensuring that we are using the pre-emptive kernel).


The “Pebbles” are really an accumulation of thousands of smaller changes that together improve CPU usage, memory usage, and database utilization. Three examples of such “Pebbles” are as follows:

  1. Code optimizations
  2. Locking improvements
  3. Database improvements

Code optimizations. Some of the code optimizations above include low-level optimizations like replacing strings with integers or refactoring code to significantly reduce the number of mallocs and frees. The vast majority of cycles used by the vpxd process are typically spent in malloc or in string manipulations (for example, serializing data responses from hosts). By reducing these overheads, we have significantly reduced the CPU and memory resources used to manage our largest test setups.

Locking improvements. Some of the locking improvements include reducing the length of critical sections and also restructuring code to enable us to remove some coarse-grained locks. For example, we have isolated situations in which an operation may have required consistent state for a cluster, its hosts, and all of its VMs, and reduced the scope so that only VM-level or host-level locks are required. These optimizations require careful reasoning about the code, but ultimately significantly improve concurrency.  An additional set of improvements involved simplifying the locking primitives themselves so that they are faster to acquire and release. These sorts of changes also improve concurrency. Improving concurrency not only improves performance, but it better enables us to take advantage of newer hardware with more cores: without such improvements, software would be a bottleneck, and the extra cores would otherwise be idle.

Database improvements. The vCenter server stores configuration and statistics data in the database. Any changes to the VM, host, or cluster configuration that occur as a result of an operation (for example, powering on a VM) must be persisted to the database. We have made an active effort to reduce the amount of data that must be stored in the database (for example, storing it on the host instead). By reducing this data, we reduce the network traffic between vCenter server and the hosts, because less data is transferred, and we also reduce disk traffic by the database.

A side benefit of using the vCenter server appliance is that the database (Postgres) is embedded in the appliance. As a result, the latency between the vpxd service and the database is minimized, resulting in performance improvements relative to using a remote database (as is typically used in vCenter Windows installations). This improvement can be 10% or more in environments with lots of operations being performed.

Benchmarking Details

Our benchmark results are based on our vcbench workload generator. A more complete description of vcbench is given in VMware vCenter Server Performance and Best Practices, but briefly, vcbench consists of a Java client that sends management operations to vCenter server. Operations include (but are not limited to) powering on VMs, cloning VMs, migrating VMs, VMotioning VMs, reconfiguring VMs, registering VMs, and snapshotting VMs. The Java client opens up tens to hundreds of concurrent sessions to the vCenter server and issues tasks on each of these sessions. A graphical depiction is given in the “VCBench Architecture” slide, above.

The performance of vcbench is typically given in terms of throughput, for example, operations per second. This number represents the number of management operations that vCenter can complete per second. To compute this value, we run vcbench for a specified amount of time (for example, several hours) and then measure how many operations have completed. We then divide by the runtime of the test. For example, 70 operations per second is 4200 operations per minute, or over 25000 operations in an hour. We run anywhere from 32 concurrent sessions to 512 concurrent sessions connected to vCenter.

The throughput measured by vcbench is dependent on the types of operations in the workload mix. We have tried to model our workload mix based on the frequency of operations in customer setups. In such setups, often power operations and provisioning operations (e.g., cloning VMs) are prevalent.

Finally, the throughput measured by vcbench also depends on hardware and other vCenter settings. For example, in our “benchmarking” runs, we run with level 1 statistics. We also do performance measurements with higher statistics levels, but our baseline measurements use level 1. In addition, we use SSDs to ensure the the disk is not a bottleneck, and we also make sure to have sufficient CPU and memory to ensure that they are not resource-constrained. By removing hardware as a constraint, we are able to find and fix bottlenecks in our software. Our benchmarking runs also typically do not have extensions like vROps or NSX connected to vCenter. We do additional runs with these extensions installed so that we can understand their impact and provide guidance to customers, but they are not part of the base performance reports.


vCenter 6.5 can support 2x the inventory size as vCenter 6.0. Moreover, vCenter 6.5 provides dramatically higher throughput than 6.0, and can manage the same environment size with less CPU and memory. In this note, I have tried to give some details regarding the source of these improvements. The gains are due to a number of significant architectural improvements (like removing the Inventory Service caching layer) as well as a great deal of low-level code optimizations (for example, reducing memory allocations and shortening critical sections). I have also provided some details about our benchmarking methodology as well as the hardware and software configuration.


The vCenter improvements described in this blog are the results of thousands of person-hours from vCenter developers, performance engineers, and others throughout VMware. I am deeply grateful to them for making this happen.

SQL Server VM Performance with VMware vSphere 6.5

Achieving optimal SQL Server performance on vSphere has been a constant focus here at VMware; I’ve published past performance studies with vSphere 5.5 and 6.0 which showed excellent performance up to the maximum VM size supported at the time.

Since then, there have been quite a few changes!  While this study uses a similar test methodology, it features an updated hypervisor (vSphere 6.5), database engine (SQL Server 2016), OLTP benchmark (DVD Store 3), and CPUs (Intel Xeon v4 processors with 24 cores per socket, codenamed Broadwell-EX).

The new tests show large SQL Server databases continue to run extremely efficiently, achieving great performance on vSphere 6.5. Following our best practices was all that was necessary to achieve this scalability – which reminds me, don’t forget to check out Niran’s new SQL Server on vSphere best practices guide, which was also just updated.

In addition to performance, power consumption was measured on each ESXi host. This allowed for a comparison of Host Power Management (HPM) policies within vSphere, performance per watt of each host, and power draw under stress versus idle:

Generational SQL Server DB Host Power and Performance/watt

Generational SQL Server DB Host Power and Performance/watt

Additionally, this new study compares a virtual file-based disk (VMDK) on VMware’s native Virtual Machine File System (VMFS 5) to a physical Raw Device Mapping (RDM). I added this test for two reasons: first, it has been several years since they have been compared; and second, customer feedback from VMworld sessions indicates this is still a debate that comes up in IT shops, particularly with regard to deploying database workloads such as SQL Server and Oracle.

For more details and the test results, download the paper: Performance Characterization of Microsoft SQL Server on VMware vSphere 6.5

Performance of Storage I/O Control (SIOC) with SSD Datastores – vSphere 6.5

With Storage I/O Control (SIOC), vSphere 6.5 administrators can adjust the storage performance of VMs so that VMs with critical workloads will get the I/Os per second (IOPS) they need. Admins assign shares (the proportion of IOPS allocated to the VM), limits (the upper bound of VM IOPS), and reservations (the lower bound of VM IOPS) to the VMs whose IOPS need to be controlled.  After shares, limits, and reservations have been set, SIOC is automatically triggered to meet the desired policies for the VMs.

A recently published paper shows the performance of SIOC meets expectations and successfully controls the number of IOPS for VM workloads.

Continue reading

Machine Learning on vSphere 6 with Nvidia GPUs – Episode 2

by Hari Sivaraman, Uday Kurkure, and Lan Vu

In a previous blog [1], we looked at how machine learning workloads (MNIST and CIFAR-10) using TensorFlow running in vSphere 6 VMs in an NVIDIA GRID configuration reduced the training time from hours to minutes when compared to the same system running no virtual GPUs.

Here, we extend our study to multiple workloads—3D CAD and machine learning—run at the same time vs. run independently on a same vSphere server.

Performance Impact of Mixed Workloads

Many customers of the NVIDIA GRID vGPU solution on vSphere run 3D CAD workloads. The traditional approach to run 3D CAD and machine learning workloads, typically, is to run the CAD workloads during the day and the machine learning workloads at night, or to have separate infrastructures for each type of workload, but this solution is inflexible and can increase deployment cost. We show, in this section, that this kind of separation is entirely unnecessary. Both workloads can be run concurrently on the same server, and the performance impact on the 3D CAD workload as well as on the machine learning workload is negligible in three out of four vGPU profiles.

vSphere 6.5 supports four vGPU profiles, and the primary difference between them is the amount of VRAM available to each VM:

  • M60-1Q: 1GB VRAM
  • M60-2Q: 2GB VRAM
  • M60-4Q: 4GB VRAM
  • M60-8Q: 8GB VRAM

In this blog, we characterize the performance impact of running 3D CAD and machine learning workloads concurrently using two benchmarks. We chose the SPECapc for 3ds Max 2015 [2] benchmark as a representative for 3D CAD workloads and so did not comply with the benchmark reporting rules, nor do we use or make comparisons to the official SPECapc metrics. We chose MNIST [3] as a representative for machine learning workloads. The performance metric in this comparison is simply the run time for the benchmark.

Our results show that the performance impact on the 3D CAD workload due to sharing the server and GPUs with the machine learning workload is below 5% (in the M60-2Q, M60-4Q, and M60-8Q profiles) when compared to running only the 3D CAD workload on the same hardware. Correspondingly, the performance impact on the machine learning workload when sharing the hardware resources with the 3D CAD workload compared to running all by itself is under 15% in the M60-2Q, M60-4Q, and M60-8Q profiles. In other words, the run time for the 3D CAD benchmark increases by less than 5% when sharing the hardware with the machine learning workload when compared to when it does not share the hardware. The increase in run time for machine learning was under 15% when sharing compared to not sharing the hardware.

Experimental Configuration and Methodology

We installed the 3D CAD benchmark in a 64-bit Windows 7 (SP1) VM with 4 vCPUs and 16GB RAM. The benchmark uses Autodesk 3ds Max 2015 software. The NVIDIA vGPU driver (#369.4) was used in the VM. We configured the vGPU profiles as M60-1Q, M60-2Q, M60-4Q, or M60-8Q for different runs. We used this VM as the golden master from which we made linked clones so that we could run the 3D CAD benchmark at scale with 1, 2, 4, …, 24 VMs running 3D CAD simultaneously. The software configurations used for the 3D CAD workload are shown in Table 2, below.

ESXi 6.5   #4240417
Guest OS CentOS Linux release 7.2.1511 (Core)
CUDA Driver & Runtime 7.5
TensorFlow 0.1

Table 1. Configuration of VM used for machine learning benchmarks

vGPU profile Total # VMs running concurrently in the test
M60-8Q 3
M60-4Q 6
M60-2Q 12
M60-1Q 24

Table 2. Software configuration used to run the 3D CAD benchmark

Our experiments includes three sets of runs. In the first set, we ran only the 3D CAD benchmark for each of the four configurations listed in Table 2, above and measured the run time, the ESXi CPU utilization, and the GPU utilization. Once this set of runs was completed, we did a second set in which we ran the 3D CAD benchmark concurrently with the machine learning benchmark. To do this, we first installed the MNIST benchmark, CUDA, cuDNN and TensorFlow in a CentOS VM with the configuration shown in Table 1, above.  Since CUDA only works with an M60-8Q profile, we used it for the VM that runs the machine learning benchmark. For the runs in this second set, we used the configurations shown in Table 4 and measured the run time for the 3D CAD benchmark, the run time for MNIST, the total ESXi CPU utilization, and the total GPU utilization. The server configuration used in our experiments is shown in Table 3, below.

Model Dell PowerEdge R730
CPU Intel Xeon Processor E5-2680 v4 @ 2.40GHz
CPU cores 28 CPUs, each @ 2.40GHz
Processor sockets 2
Cores per socket 14
Logical processors 56
Hyperthreading Active
Memory 768GB
Storage Local SSD (1.5TB), storage arrays, local hard disks
GPUs 2x NVIDIA Tesla M60
ESXi 6.5   #4240417
NVIDIA vGPU driver in ESXi host 367.38
NVIDIA vGPU driver inside VM 369.04

Table 3. Server configuration

vGPU profile used for 3D CAD VMs only #VMs running 3D CAD # VMs running MNIST Total # VMs running concurrently in test
M60-8Q 3 1 4
M60-4Q 6 1 7
M60-2Q 12 1 13
M60-1Q 24 1 25

Table 4. Software configuration used to run the mixed workloads (please note that that the machine learning VM can be configured only using the M60-8Q profile; no other profiles are supported)

We did a third set of runs in which we ran only MNIST on the server. Since MNIST runs only in the M60-8Q profile, only one run was done in this third set.


We compared the run times of the first set of runs (3D CAD benchmark only) with the ones of the second set (3D CAD and MNIST are run concurrently) as well as computed the percentage increase in the run time for 3D CAD when it shares the server with MNIST compared to when it ran without MNIST. So specifically, say, in the M60-4Q profile,  we computed the percentage increase in run time for 3D+ML compared to 3D only in the M60-4Q profile. We also measured the run time of MNIST running concurrently with 3D CAD with the run time of MNIST running by itself on the server and computed the percentage increase in run time for MNIST. We call this increase in run time the performance drop or change. The computed values are shown in Figure 1 below.


Figure 1. Percentage increase in run time for 3D graphics (3D) and machine learning (ML) workloads due to running concurrently compared to running in isolation

From the figure we can see that in the M60-8Q, M60-4Q, and M60-2Q profiles, the run times for 3D CAD when it shares the server with machine learning compared to when 3D CAD runs by itself is less than 5%. For the MNIST machine learning workload, the performance penalty due to sharing compared to no sharing is under 15% in the M60-2Q, M60-4Q, and M60-8Q profiles. Only the M60-1Q profile that can support up to 24 VMs running 3D CAD and one VM running MNIST show any significant performance penalty due to sharing. Now, if the workloads were run sequentially, the total time to complete the tasks would be the sum of the run time for 3D CAD and the machine learning workloads.

A comparison of the total run time for ML and 3D CAD workloads is shown in Figure 2. From the figure, we can see that the total time to completion of the workloads is always less when run concurrently as opposed to when run sequentially.


⇑ Figure 2. It takes a longer time to sequentially run a 3D plus ML (machine learning) mixed workload when compared to the time to run a concurrent mixed workload. The original time was in seconds, but we have normalized the concurrent time to 1 so that the change in sequential time stands out.



Figure 3. CPU utilization on server for mixed workload configuration (3D+ML) and for 3D graphics only (3D)

Further, running the workloads concurrently results in higher server utilization, which could result in higher revenues for a cloud service provider. The M60-1Q profile does show a higher time to complete when workloads are run concurrently when compared to being run sequentially, but it does achieve very high consolidation (measured as number of VMs per core) and high server utilization. So, if in the M60-1Q profile, the higher time to complete the workload run can be tolerated, the configuration that runs the workloads concurrently would achieve higher revenues for a cloud service provider because of higher server utilization. The CPU utilization on the server for the M60-8Q, M60-4Q, M60-2Q, and M60-1Q profiles with only 3D CAD (3D) and with 3D CAD plus machine learning (3D+ML) are shown in Figure 3, above.


  • Simultaneously running 3D CAD and machine learning workloads reduces the total time to complete the runs with the M60-2Q, M60-4Q, and M60-8Q profiles compared to running the workloads sequentially. This is a radical departure from traditional approaches to scheduling machine learning and 3D CAD workloads.
  • Running 3D graphics and machine learning workloads concurrently increases server utilization, which could result in higher revenues for a cloud service provider.


[1] Machine Learning on VMware vSphere 6 with NVIDIA GPUs

[2] The MNIST Database of Handwritten Digits

[3] SPECapc for 3ds Max 2015



New Fling released – IOInsight

By Sankaran Sivathanu

VMware IOInsight is a tool to help people understand a VM’s storage I/O behavior. By understanding their VM’s I/O characteristics, customers can make better decisions about storage capacity planning and performance tuning. IOInsight ships as a virtual appliance that can be deployed in any vSphere environment and includes an intuitive web-based UI that allows users to choose VMDKs to monitor and view results.

Where does IOInsight help?

  • Customers may better tune and size their storage.
  • When contacting VMware Support for any vSphere storage issues, including a report from IOInsight can help VMware Support better understand the issues and can potentially lead to faster resolutions.
  • VMware Engineering can optimize products with a better understanding of various customers’ application behavior.

IOInsight captures I/O traces from ESXi and generates various aggregated metrics that represent the I/O behavior. The IOInsight report contains only these aggregated metrics and there is no sensitive information about the application itself. In addition to the built-in metrics computed by IOInsight, users can also write new analyzer plugins to IOInsight and visualize the results. A comprehensive SDK and development guide is included in the download bundle.

The fling works with vSphere 5.5 or above and can be downloaded at https://labs.vmware.com/flings/ioinsight.

vCenter Server 6.5 High Availability Performance and Best Practices

High availability (aka HA) services are important in any platform, and VMware vCenter Server® is no exception. As the main administrative and management tool of vSphere, it is a critical element that requires HA. vCenter Server HA (aka VCHA) delivers protection against software and hardware failures with excellent performance for common customer scenarios, as shown in this paper.

Much work has gone into the high availability feature of VMware vCenter Server® 6.5 to ensure that this service and its operations minimally affect the performance of your vCenter Server and vSphere hosts. We thoroughly tested VCHA with a benchmark that simulates common vCenter Server activities in both regular and worst case scenarios. The result is solid data and a comprehensive performance characterization in terms of:

  • Performance of VCHA failover/recovery time objective (RTO): In case of a failure, vCenter Server HA (VCHA) provides failover/RTO such that users can continue with their work in less than 2 minutes through API clients and less than 4 minutes through UI clients. While failover/RTO depends on the vCenter Server configuration and the inventory size, in our tests it is within the target limit, which is 5 minutes.
  • Performance of enabling VCHA: We observed that enabling VCHA would take around 4 – 9 minutes depending on the vCenter Server configuration and the inventory size.
  • VCHA overhead: When VCHA is enabled, there is no significant impact for vCenter Server under typical load conditions. We observed a noticeable but small impact of VCHA when the vCenter Server was under extreme load; however, it is unlikely for customers to generate that much load on the vCenter Server for extended time periods.
  • Performance impact of vCenter Server statistics level: With an increasing statistics level, vCenter Server produces less throughput, as expected. When VCHA is enabled for various statistics levels, we observe a noticeable but small impact of 3% to 9% on throughput.
  • Performance impact of a private network: VCHA is designed to support LAN networks with up to 10 ms latency between VCHA nodes. However, this comes with a performance penalty. We study the performance impact of the private network in detail and provide further guidelines about how to configure VCHA for the best performance.
  • External Platform Services Controller (PSC) vs Embedded PSC: We study VCHA performance comparing these two deployment modes and observe a minimal difference between them.

Throughout the paper, our findings show that vCenter Server HA performs well under a variety of circumstances. In addition to the performance study results, the paper describes the VCHA architecture and includes some useful performance best practices for getting the most from VCHA.

For the full paper, see VMware vCenter Server High Availability Performance and Best Practices.

vSphere 6.5 Update Manager Performance and Best Practices

vSphere Update Manager (VUM) is the patch management tool for VMware vSphere 6.5. IT administrators can use VUM to patch and upgrade ESXi hosts, VMware Tools, virtual hardware, and virtual appliances.

In the vSphere 6.5 release, VUM has been integrated into the vCenter Server appliance (VCSA) for the Linux platform. The integration eliminates remote data transfers between VUM and VCSA, and greatly simplifies the VUM deployment process. As a result, certain data-driven tasks achieve a considerable performance improvement over VUM for the Windows platform, as illustrated in the following figure:


To present the new performance characteristics for VUM in vSphere 6.5, a paper has been published. In particular, the paper describes the following topics:

  • VUM server deployment
  • VUM operations including scan host, scan VM, stage host, remediate host, and remediate VM
  • Remediation concurrency
  • Resource consumption
  • Running VUM operations with vCenter Server provisioning operations

The paper also offers a number of performance tips and best practices for using VUM during patch maintenance. For the full details, read vSphere Update Manager Performance and Best Practices.

Whitepaper on vSphere Virtual Machine Encryption Performance

vSphere 6.5 introduces a feature called vSphere VM encryption.  When this feature is enabled for a VM, vSphere protects the VM data by encrypting all its contents.  Encryption is done both for already existing data and for newly written data. Whenever the VM data is read, it is decrypted within ESXi before being served to the VM.  Because of this, vSphere VM encryption can have a performance impact on application I/O and the ESXi host CPU usage.

We have published a whitepaper, VMware vSphere Virtual Machine Encryption Performance, to quantify this performance impact.  We focus on synthetic I/O performance on VMs, as well as VM provisioning operations like clone, snapshot creation, and power on.  From analysis of our experiment results, we see that while VM encryption consumes more CPU resources for encryption and decryption, its impact on I/O performance is minimal when using enterprise-class SSD or VMware vSAN storage.  However, when using ultra-high performance storage like locally attached NVMe drives capable of handling up to 750,000 IOPS, the minor increase in per-I/O latency due to encryption or decryption adds up quickly to have an impact on IOPS.

For more detailed information and data, please refer to the whitepaper