Home > Blogs > VMware VROOM! Blog

Network Improvements in vSphere 6 Boost Performance for 40G NICs

Introduced in vSphere 5.5, a Linux-based driver was added to support 40GbE Mellanox adapters on ESXi. Now vSphere 6.0 adds a native driver and Dynamic NetQueue for Mellanox, and these features  significantly improve network performance. In addition to the device driver changes, vSphere 6.0 includes improvements to the vmxnet3 virtual NIC (vNIC) that allows a single vNIC to achieve line-rate performance with 40GbE physical NICs. Another performance feature introduced in 6.0 for high bandwidth NICs is NUMA Aware I/O which improves performance by collocating highly network-intensive workloads with the device NUMA node. In this blog, we highlight these features and the corresponding benefits achieved.

Test Configuration

We used two identical Dell PowerEdge R720 servers with Intel E5-2667 @ 2.90GHz and 64GB of memory and Mellanox Technologies MT27500 Family [ConnectX-3]  /  Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network NICs for our tests.

In the single VM test, we used 1 RHEL 6 VM with 4 vCPUs on each ESXi host with 4 netperf TCP streams running. We then measured the cumulative throughput for the test.

For the multi-VM test, we configured multiple RHEL VMs with 1 vCPU each and used an identical number of VMs on the receiver side. Each VM used 4 sessions of netperf for driving traffic, and we measured the cumulative throughput across the VMs.

Single vNIC Performance Improvements

In order to achieve line-rate performance for vmxnet3, changes were made to the virtual NIC adapter for vSphere 6.0 so that multiple hardware queues could push data to vNICs simultaneously. This allows vmxnet3 to use multiple hardware queues from the physical NIC more effectively. This not only increases the throughput a single vNIC can achieve, but also helps in overall CPU utilization.

As we can see from figure 1 below, 1 VM with 1 vNIC on vSphere 6.0 can achieve more than 35Gbps of throughput as compared to 20Gbps achieved in vSphere 5.5 (indicated by the blue bar chart). The CPU used to receive 1Gbps of traffic, on the other hand, is reduced by 50% (indicated by the red line chart).

Single VM throughput

Figure 1. 1VM vmxnet3 Receive throughput

By default, a single vNIC can receive packets from a single hardware queue. To achieve higher throughput, the vNIC has to request more queues. This can be done by setting ethernetX.pnicFeatures = “4” in the .vmx file. This option also requires the physical NIC to have RSS mode turn on. For Mellanox adapters, the RSS feature can be turned on by reloading the driver with num_rings_per_rss_queue=4.

CPU Cost Improvements for Mellanox 40GbE NIC

In addition to scalability improvements for the vmxnet3 adapter, vSphere 6.0 features an improved version of the Mellanox 40GbE NIC driver. The updated driver uses vSphere 6.0 APIs and, as a result, performs better than the earlier Linux-based driver. Native APIs remove the extra CPU overheads of data structure conversion that were earlier present in the Linux-based driver. The driver also has new features like Dynamic NetQueue that improves CPU utilization even further. Dynamic netqueue in vSphere 6.0 intelligently chooses the optimal number of active hardware queues in use according to the network workload and per NUMA-node CPU utilization.

40G Performance

Figure 2: Multi VM CPU usage for 40G traffic

As seen in figure 2 above, the new driver can improve CPU efficiency by up to 22%.  For all these test cases, the Mellanox NIC was achieving line-rate throughput for both vSphere 6.0 and vSphere 5.5. Please note that for the multi-VM tests, we were using a 1-vCPU VM and vmxnet3 was using a single queue. The RSS feature on the Mellanox Adapter was also turned off.

NUMA Aware I/O

In order to achieve the best performance out of 40GbE NICs, it is advisable to place the throughput-intensive workload on the same NUMA system to which the adapter is attached. vSphere 6.0 features a new configuration option that tries to do this automatically and is available through a system-wide option. The configuration will pack all kernel networking threads on the same NUMA node to which the device is connected. The scheduler will then try to place the VMs that use these networking threads the most on the same NUMA node. By default, the configuration is turned off because it may cause uneven workload distribution between multiple NUMA nodes, especially in the cases where all NICs are connected to the same NUMA node.


Figure 3: NUMA I/O benefit.

As seen in Figure 3 above, NUMA I/O can result in about 20% reduced CPU consumption and about 20% higher throughput with a 1-vCPU VM for 40GbE NICs. There is no throughput improvement for Intel NICs because we achieve line rate irrespective of where the workloads are placed. We do however see an increase in CPU efficiency of about 7%.

To enable this option, set the value of Net. NetNetqNumaIOCpuPinThreshold in the Advanced System Settings tab for the host. The value is configurable and can vary between 0 and 200. For example, if you set the value to 100, this results in using NUMA I/O as long as the networking load is less than 100% (that is, the networking threads do not use more than 1 core). Once the load increases to 100%, vSphere 6.0 will follow default scheduling behavior and will schedule VMs and networking threads across different NUMA nodes.


vSphere 6.0 includes some great new improvements in network performance. In this blog, we show:

  • Vmxnet3 can now achieve near line-rate performance with a 40GbE NIC.
  • Significant performance improvements were made to the Mellanox driver, which is now up to 25% more efficient.
  • vSphere also features a new option to turn on NUMA I/O that could improve application performance by up to 15%.


Scaling Web 2.0 Applications using Docker containers on vSphere 6.0

by Qasim Ali

In a previous VROOM post, we showed that running Redis inside Docker containers on vSphere adds little to no overhead and observed sizeable performance improvements when scaling out the application when compared to running containers directly on the native hardware. This post analyzes scaling Web 2.0 applications using Docker containers on vSphere and compares the performance of running Docker containers on native and vSphere. This study shows that Docker containers add negligible overhead when run on vSphere, and also that the performance using virtual machines is very close to native and, in certain cases, slightly better due to better vSphere scheduling and isolation.

Web 2.0 applications are an integral part of Enterprise and small business IT offerings. We use the CloudStone benchmark, which simulates a typical Web 2.0 technology use in the workplace for our study [1] [2]. It includes a Web 2.0 social-events application (Olio) and a client implemented using the Faban workload generator [3]. It is an open source benchmark that simulates activities related to social events. The benchmark consists of three main components: a Web server, a database backend, and a client to emulate real world accesses to the Web server. The overall architecture of CloudStone is depicted in Figure 1.

Figure 1: CloudStone architecture

The benchmark reports latency for various user actions. These metrics were compared against a fixed threshold. Studies indicate that users are less likely to visit a Web site if the response time is greater than 250 milliseconds [4]. This number can be used as an upper bound for latency for frequent operations (Home-Page, TagSearch, EventDetail, and Login). For the less frequent operations (AddEvent, AddPerson, and PersonDetail), a less restrictive threshold of 500 milliseconds can be used. Table 1 shows the exact mix/frequency of various operations.

Operation Number of Operations Mix
HomePage 141908 26.14%
Login 55473 10.22%
TagSearch 181126 33.37%
EventDetail 134144 24.71%
PersonDetail 14393 2.65%
AddPerson 4662 0.86%
AddEvent 11087 2.04%

Table 1: CloudStone operations frequency for 1500 users

Benchmark Components and Experimental Set up

The test system was installed with a CloudStone implementation of a MySQL database, NGINX Web server with PHP scripts, and a Tomcat application server provided by the Faban harness. The default configuration was used for the workload generator. All components of the application ran on a single host, and the client ran in a separate virtual machine on a separate host. Both hosts were connected using a direct link between a pair of 10Gbps NICs. One client-server pair provided a single, independent CloudStone instance. Scaling was achieved by running additional instances of CloudStone.

Deployment Scenarios

We used the following three deployment scenarios for this study:

  • Native-Docker: One or more CloudStone instances were run inside Docker containers (2 containers per CloudStone instance: one for the Web server and another for the database backend) running on the native OS.
  • VM: CloudStone instances were run inside one or more virtual machines running on vSphere 6.0; the guest OS is the same as the native scenario.
  • VM-Docker: CloudStone instances were run inside Docker containers that were running inside one or more virtual machines.

Hardware/Software/Workload Configuration

The following are the details about the hardware and software used in the various experiments discussed in the next section:

Server Host:

  • Dell PowerEdge R820
  • CPU: 4 x Intel® Xeon® CPU E5-4650 @ 2.30GHz (32 cores, 64 hyper-threads)
  • Memory: 512GB
  • Hardware configuration: Hyper-Threading (HT) ON, Turbo-boost ON, Power policy: Static High (that is, no power management)
  • Network: 10Gbps
  • Storage: 7 x 250GB 15K RPM 4Gb SAS  Disks

Client Host:

  • Dell PowerEdge R710
  • CPU: 2 x Intel® Xeon® CPU X5680 @ 3.33GHz (12 cores, 24 hyper-threads)
  • Memory: 144GB
  • Hardware configuration: HT ON, Turbo-boost ON, Power policy: Static High (that is, no power management)
  • Network: 10Gbps
  • Client VM: 2-vCPU 4GB vRAM

Host OS:

  • Ubuntu 14.04.1
  • Kernel 3.13

Docker Configuration:

  • Docker 1.2
  • Ubuntu 14.04.1 base image
  • Host volumes for database and images
  • Configured with host networking to avoid Docker NAT overhead
  • Device mapper as the storage backend driver


  • VMware vSphere 6.0 (pre-release build)

VM Configurations:

  • Single VM: An 8-vCPU 4GB VM ( Web server and database running in a single VM)
  • Two VMs: One 6-vCPU 2GB Web server VM and one 2-vCPU 2GB database VM (CloudStone instance running in two VMs)
  • Scale-out: Eight 8-vCPU 4GB VMs

Workload Configurations:

  • The NGINX Web server was configured with 4 worker processes and 4096 connections per worker.
  • PHP was configured with a maximum number of 16 child processes.
  • The Web server and the database were preconfigured with 1500 users per CloudStone instance.
  • A runtime of 30 minutes with a 5 minute ramp-up and ramp-down periods (less than 1% run-to-run  variation) was used.


First, we ran a single instance of CloudStone in the various configurations mentioned above. This was meant to determine the raw overhead of Docker containers running on vSphere vs. the native configuration, eliminating scheduling differences. Second, we picked the configuration that performed best in a single instance and scaled it out to run multiple instances.

Figure 2 shows the mean latency of the most frequent operations and Figure 3 shows the mean latency of less frequent operations.

Figure 2: Results of single instance CloudStone experiments for frequently used operations


Figure 3: Results of single instance CloudStone experiments for less frequently used

We configured the benchmark to use a single VM and deployed the Web server and database applications in it (configuration labelled VM-1VM in Figure 2 and 3). We then ran the same workload in Docker containers in a single VM (VMDocker-1VM). The latencies are slightly higher than native, which is expected due to some virtualization overhead. However, running Docker containers on a VM showed no additional overhead. In fact, it seems to be slightly better. We believe this might be due to the device mapper using twice as much page cache as the VM (device mapper uses a loopback device to mount the file system and, hence, data ends up being cached twice in the buffer cache). We also tried AuFS as a storage backend for our container images, but that seemed to add some CPU and latency overhead, and, for this reason, we switched to device mapper. We then configured the VM to use the vSphere Latency Sensitivity feature [5] (VM-1VM-lat and VMDocker-1VM-lat labels). As expected, this configuration reduced the latencies even further because each vCPU got exclusive access to a core and this reduced scheduling overhead.  However, this feature cannot be used when the VM (or VMs) has more vCPUs than the number of cores available on the system (that is, the physical CPUs are over-committed) because each vCPU needs exclusive access to a core.

Next, we configured the workload to use two VMs, one for the Web server and the other for the database application. This configuration ended up giving slightly higher latencies because the network packets have to traverse the virtualization layer from one VM to the other, while in the prior experiments they were confined within the same VM.

Finally, we scaled out the CloudStone workload with 12,000 users by using eight 8-vCPU VMs with 1500 users per instance. The VM configurations were the same as the VM-1VM and VMDocker-1VM cases above. The average system CPU core utilization was around 70-75%, which is the typical average CPU utilization for latency sensitive workloads because it allows for headroom to absorb traffic bursts. Figure 4 reports mean latencies of all operations (latencies were averaged across all eight instances of CloudStone for each operation), while Figure 5 reports the 90th percentile latencies (the benchmark reports these latencies in 20 millisecond granularity as evident from Figure 5.)


Figure 4: Scale-out experiments using eight instances of CloudStone (mean latency)


Figure 5: Scale-out experiments using eight instances of CloudStone (90th percentile latency)

The latencies shown in Figure 4 and 5 are well below the 250 millisecond threshold. We observed that the latencies on vSphere are very close to native or, in certain cases, slightly better than native (for example, Login, AddPerson and AddEvent operations). The latencies were better than native due to better vSphere scheduling and isolation, resulting in better cache/memory locality. We verified this by pinning container instances on specific sockets and making the native scheduler behavior similar to vSphere. After doing that, we observed that latencies in the native case got better and they were similar or slightly better than vSphere.

Note: Introducing artificial affinity between processes and cores is not a recommended practice because it is error-prone and can, in general, lead to unexpected or suboptimal results.


VMs and Docker containers are truly “better together.” The CloudStone scale-out system, using out-of-the-box VM and VM-Docker configurations, clearly achieves very close to, or slightly better than, native performance.


[1] W. Sobel, S. Subramanyam, A. Sucharitakul, J. Nguyen, H. Wong, A. Klepchukov, S. Patil, O. Fox and D. Patterson, “CloudStone: Multi-Platform, Multi-Languarge Benchmark and Measurement Tools for Web 2.8,” 2008.
[2] N. Grozev, “Automated CloudStone Setup in Ubuntu VMs / Advanced Automated CloudStone Setup in Ubuntu VMs [Part 2],” 2 June 2014. https://nikolaygrozev.wordpress.com/tag/cloudstone/.
[3] java.net, “Faban Harness and Benchmark Framework,” 11 May 2014. http://java.net/projects/faban/.
[4] S. Lohr, “For Impatient Web Users, an Eye Blink Is Just Too Long to Wait,” 29 February 2012. http://www.nytimes.com/2012/03/01/technology/impatient-web-usersflee-slow-loading-sites.html.
[5] J. Heo, “Deploying Extremely Latency-Sensitive Applications in VMware vSphere 5.5,” 18 September 2013.   http://blogs.vmware.com/performance/2013/09/deploying-extremely-latency-sensitive-applications-in-vmware-vsphere-5-5.html.




VMware Horizon 6 with View: Performance Testing and Best Practices

In the blog here, we just published the updated white paper VMware Horizon 6 with View Performance and Best Practices that describes the performance gains achieved with the latest Horizon 6 enhancements. The paper details the architecture systems used for testing the features and recommends best practices for configuring your system.

The white paper talks about the following performance results:

  • RDSH sizing
  • Display protocol performance
  • PCoIP default settings changes
  • VDI characterization on Virtual SAN

Finally, some of the best practices are presented:

  • RDSH virtual machine sizing
  • RDSH session sizing
  • RDSH server virtual machine optimization
  • Guest best practices for bandwidth and storage
  • PCoIP settings
  • 3D graphics settings
  • Virtual SAN configurations

Please check out the VMware Horizon 6 with View Performance and Best Practices  paper for detailed descriptions of the tests, the results of those tests and the best practices.


Virtual SAN 6.0 Performance with VMware VMmark

Virtual SAN is a storage solution that is fully integrated with VMware vSphere. Virtual SAN leverages flash technology to cache data and improve its access time to and from the disks. We used VMware’s VMmark 2.5 benchmark to evaluate the performance of running a variety of tier-1 application workloads together on Virtual SAN 6.0.

VMmark is a multi-host virtualization benchmark that uses varied application workloads and common datacenter operations to model the demands of the datacenter. Each VMmark tile contains a set of virtual machines running diverse application workloads as a unit of load. For more details, see the VMmark 2.5 overview.


Testing Methodology

VMmark 2.5 requires two datastores for its Storage vMotion workload, but Virtual SAN creates only a single datastore. A Red Hat Enterprise Linux 7 virtual machine was created on a separate host to act as an iSCSI target to serve as the secondary datastore. Linux-IO Target (LIO) was used for this.



Systems Under Test 8x Supermicro SuperStorage SSG-2027R-AR24 servers
CPUs (per server) 2x Intel Xeon E5-2670 v2 @ 2.50 GHz
Memory (per server) 256 GiB
Hypervisor VMware vSphere 5.5 U2 and vSphere 6.0
Local Storage (per server) 3x 400GB Intel SSDSC2BA4012x 900GB 10,000 RPM WD Xe SAS drives
Benchmarking Software VMware VMmark 2.5.2


Workload Characteristics

Storage performance is often measured in IOPS, or I/Os per second. Virtual SAN is a storage technology, so it is worthwhile to look at how many IOPS VMmark is generating.  The most disk-intensive workloads within VMmark are DVD Store 2 (also known as DS2), an E-Commerce workload, and the Microsoft Exchange 2007 mail server workload. The graphs below show the I/O profiles for these workloads, which would be identical regardless of storage type.


The DS2 database virtual machine shows a fairly balanced I/O profile of approximately 55% reads and 45% writes.

Microsoft Exchange, on the other hand, has a very write-intensive load, as shown below.


Exchange sees nearly 95% writes, so the main benefit the SSDs provide is to serve as a write buffer.

The remaining application workloads have minimal disk I/Os, but do exert CPU and networking loads on the system.



VMmark measures both the total throughput of each workload as well as the response time.  The application workloads consist of Exchange, Olio (a Java workload that simulates Web 2.0 applications and measures their performance), and DVD Store 2. All workloads are driven at a fixed throughput level.  A set of workloads is considered a tile.  The load is increased by running multiple tiles.  With Virtual SAN 6.0, we could run up to 40 tiles with acceptable quality of service (QoS). Let’s look at how each workload performed with increasing the number of tiles.

DVD Store

There are 3 webserver frontends per DVD Store tile in VMmark.  Each webserver is loaded with a different profile.  One is a steady-state workload, which runs at a set request rate throughout the test, while the other two are bursty in nature and run a 3-minute and 4-minute load profile every 5 minutes.  DVD Store throughput, measured in orders per minute, varies depending on the load of the server. The throughput will decrease once the server becomes saturated.


For this configuration, maximum throughput was achieved at 34 tiles, as shown by the graph above.  As the hosts become saturated, the throughput of each DVD Store tile falls, resulting in a total throughput decrease of 4% at 36 tiles. However, the benchmark still passes QoS at 40 tiles.

Olio and Exchange

Unlike DVD Store, the Olio and Exchange workloads operate at a constant throughput regardless of server load, shown in the table below:

Workload Simulated Users Load per Tile
Exchange 1000 320-330 Sendmail actions per minute
Olio 400 4500-4600 operations per minute


At 40 tiles the VMmark clients are sending over ~12,000 mail messages per minute and the Olio webservers served ~180,000 requests per minute.

As the load increases, the response time of Exchange and Olio increases, which makes them a good demonstration of the end-user experience at various load levels. A response time of over 500 milliseconds is considered to be an unacceptable user experience.


As we saw with DVD Store, performance begins to dramatically change after 34 tiles as the cluster becomes saturated.  This is mostly seen in the Exchange response time.  At 40 tiles, the response time is over 300 milliseconds for the mailserver workload, which is still within the 500 millisecond threshold for a good user experience. Olio has a smaller increase in response time, since it is more processor intensive.  Exchange has a dependence on both CPU and disk performance.

Looking at Virtual SAN performance, we can get a picture of how much I/O is served by the storage at these load levels.  We can see that reads average around 2000 read I/Os per second:


The Read Cache hit rate is 98-99% on all the hosts, so most of these reads are being serviced by the SSDs. Write performance is a bit more varied.


We see a range of 5,000-10,000 write IOPS per node due to the write-intensive Exchange workload. Storage is nowhere close to saturation at these load levels. The magnetic disks are not seeing much more than 100 I/Os per second, while the SSDs are seeing about 3,000 – 6,000 I/Os per second. These disks should be able to handle at least 10x this load level. The real bottleneck is in CPU usage.

Looking at the CPU usage of the cluster, we can see that the usage levels out at 36 tiles at about 84% used.  There is still some headroom, which explains why the Olio response times are still very acceptable.


As mentioned above, Exchange performance is dependent on both CPU and storage. The additional CPU requirements that Virtual SAN imposes on disk I/O causes Exchange to be more sensitive to server load.


Performance Improvements in Virtual SAN 6.0 (vs. Virtual SAN 5.5)

The Virtual SAN 6.0 release incorporates many improvements to CPU efficiency, as well as other improvements. This translates to increased performance for VMmark.

VMmark performance increased substantially when we ran the tests with Virtual SAN 6.0 as opposed to Virtual SAN 5.5. The Virtual SAN 5.5 tests failed to pass QoS beyond 30 tiles, meaning that at least one workload failed to meet the application latency requirement.  During the Virtual SAN 5.5 32-tile tests, one or more Exchange clients would report a Sendmail latency of over 500ms, which is determined to be a QoS failure.  Version 6.0 was able to achieve passing QoS at up to 40 tiles.


Not only were more virtual machines able to be supported on Virtual SAN 6.0, but the throughput of the workloads increased as well.  By comparing the VMmark score (normalized to 20-tile Virtual SAN 5.5 results) we can see the performance improvement of Virtual SAN 6.0.


Virtual SAN 6.0 achieved a performance improvement of 24% while supporting 33% more virtual machines.



Using VMmark, we are able to run a variety of workloads to simulate applications in a production environment.  We were able to demonstrate that Virtual SAN is capable of achieving good performance running heterogeneous real world applications.  The cluster of 8 hosts presented here show good performance in VMmark through 40 tiles.  This is ~12,000 mail messages per minute sent through Exchange, ~180,000 requests per minute served by the Olio webservers, and over 200,000 orders per minute processed on the DVD Store database.  Additionally, we were able to measure substantial performance improvements over Virtual SAN 5.5 using Virtual SAN 6.0.


VMware View Planner 3.5 and Use Cases

by   Banit Agrawal     Nachiket Karmarkar

VMware View Planner 3.5 was recently released which introduces a slew of new features, enhancements in user experience, and scalability. In this blog, we present some of these new features and use cases. More details can be found in the whitepaper here.

In addition to retaining all the features available in VMware View Planner 3.0, View Planner 3.5 addresses the following new use cases:

  • Support for VMware Horizon 6  (support of RDSH session and application publishing)
  • Support for Windows 8.1 as desktops
  • Improved user experience
  • Audio-Video sync (AVBench)
  • Drag and Scroll workload (UEBench)
  • Support for Windows 7 as clients

In View Planner 3.5, we augment the capability of View Planner to quantify user experience for user sessions and application remoting provided through remote desktop session hosts (RDSH) as a sever farm. Starting this release, we will support Windows 8.1 as one of the supported guest OSes for desktops and Windows 7 as the supported guest OS for clients.

New Interactive Workloads

We also introduced two advanced workloads: (1) Audio-Video sync (AVBench) and (2) Drag and Scroll workload (UEBench). AVBench determines audio fidelity in a distributed environment where audio and video streams are not tethered. The “Drag and Scroll” workload determines spatial and temporal variance by emulating user events like mouse click, scroll, and drag.


Fig 1. Mouse-click and drag  (UEBench)

As seen in Figure 1, a mouse event is sent to the desktop and the red and black image is dragged across and down the screen.


Fig. 2. Mouse-click and scroll (UEBench)

Similarly, Figure 2 depicts a mouse event sent to the scroll bar of an image that is scrolled up and down.

Better Run Status Reporting

As part of improving the user experience, the UI can track the current stage the View Planner run is in and notifies the user through a color-coded box. The text inside the box is a clickable link that provides a pop-up giving deeper insight about that particular stage.


Fig. 3. View Planner run status shows the intermediate status of the run

Pre-Check Run Profile for Errors

A “check” button provides users a way to verify the correctness of their run-profile parameters.


Fig. 4. ‘Check’ button in Run & Reports tab in View Planner UI

 In the past, users needed to optimize the parent VMs used for deploying clients and desktop pools. View Planner 3.5 has automated these optimizations as part of installing the View Planner agent service. The agent installer also comes with a UI that tracks the current stage the installer is in and highlights the results of various installer stages.

Sample Use Cases

Single Host VDI Scaling

Through this release, we have re-affirmed the use case of View Planner as an ideal tool for platform characterization for VDI scenarios.  On a Cisco UCS C240 server, we started with a small number of desktops running the “standard benchmark profile” and increased them until the Group A and Group B numbers exceeded the threshold. The results below demonstrate the scalability of a single UCS C240 server as a platform for VDI deployments.


Fig. 5. Single server characterization with hosted desktops for CISCO UCS C240

Single Host RDSH Scaling

We followed the best practices prescribed in the VMware Horizon 6 RDSH Performance & Best Practices whitepaper  and set up a number of remote desktop session (RDS) servers that would fully consolidate a single UCS C240 vSphere server. We started with a small number of user sessions per core and then increased them until the Group A and Group B numbers exceeded the threshold level. The results below demonstrate how ViewPlanner can accurately gauge the scalability of a platform (CISCO UCS in this case) when deployed in an RDS scenario


Fig. 6. Single server characterization with RDS sessions for CISCO UCS C240

Storage Characterization

View Planner can also be used to characterize storage array performance. The scalability of View Planner 3.5 to drive a workload on thousands of virtual desktops and process the results thereafter makes it an ideal candidate to validate storage array performance at scale. The results below demonstrate scalability of VDI desktops obtained on Pure Storage FA-420 all-flash array. View Planner 3.5 could easily scale to 3000 desktops, as highlighted in the results below.


Fig. 7. 3000 Desktops QoS results on Pure Storage FA-420 storage array

Custom Applications Scaling

In addition to characterizing platform and storage arrays, the custom app framework can achieve targeted VDI characterization that is application specific. The following results show Visio as an example of a custom app scale study on an RDS deployment with a 4-vCPU, 10GB vRAM Windows 2008 Server.


Fig. 8 Visio operation response times with View Planner 3.5 when scaling up application sessions

Other Use Cases

With a plethora of features, supported guest OSes, and configurations, it is no wonder that View Planner is capable to of characterizing multiple VMware solutions and offerings that work in tandem with VMware Horizon. View Planner 3.5 can also be used to evaluate the following features, which are described in more detail in the whitepaper:

  • VMware Virtual SAN
  • VMware Horizon Mirage
  • VMware App Volumes

For more details about new features, use cases, test environment, and results, please refer to the View Planner 3.5 white paper here.

SQL Server VM Performance on VMware vSphere 6

Last October, I blogged about SQL Server performance with vSphere 5.5 using a four-socket Intel Xeon processor E7 based host.  Now that vSphere 6 is available, I’ve run an updated set of tests using this new release, on an even more powerful host, with Xeon E7 v2 processors.  A variety of virtual CPU (vCPU) and virtual machine (VM) quantities were tested to show that vSphere can handle hundreds of thousands of online transaction processing (OLTP) database operations per minute.

DVD Store 2.1, an open-source OLTP database stress tool, was the workload used to stress the VMs.  The first experiment in the paper was a generational performance comparison between the old and new setups; as you can see, there is a dramatic increase in throughput, even though the size of each VM has doubled from 8 vCPUs per VM to 16:

Generational performance improvement from old study to new study

There are also tests using CPU affinity to show the performance differences between physical cores and logical processors (Hyper-Threads), the benefit of “right-sizing” virtual machines, and measuring the impact of the advanced Latency Sensitivity setting. 

For more details and the test results, please download the whitepaper: Performance Characterization of Microsoft SQL Server on VMware vSphere 6.

Introducing the Zephyr Benchmark

The ways in which we use, design, deploy, and evaluate the performance of large-scale web applications have changed significantly in recent years.  These changes have been driven by the increase in computing capacity and flexibility provided by virtualized and cloud-based computing infrastructures. The majority of these changes are not reflected in current web-application benchmarks.

Zephyr is a new web-application benchmark we have been developing as part of our work on optimizing the performance of VMware products for the next generation of cloud-scale applications. The goal of the Zephyr project has been to develop an application-level benchmark that captures the key characteristics of the workloads, design paradigms, deployment architectures, and performance metrics of the next generation of large-scale web applications. As we approach the initial release of Zephyr, we are starting to use it to understand performance across our product range.  In this post, we will give an overview of Zephyr that will provide context for the performance results that we will be writing about over the coming months.

Zephyr Motivation

There have been many changes in usage patterns and development practices for large-scale web applications.  The design and development of Zephyr has been driven by the goal of capturing these changes in a highly scalable benchmark that includes these key aspects:

  • The effect of increased user interactivity and rich web interfaces on workload patterns
  • New design patterns for decoupled and asynchronous services
  • The use of multiple data sources for data with varying scalability and consistency requirements
  • Flexible architectures that allow for deployment on a wide range of virtual and cloud-based infrastructures

The effect of increased user interactivity and rich web interfaces is one of the most important of these aspects. In current benchmarks, a user is represented by a single thread operating independently from other users. Contrast that to the way we interact with applications as diverse as social media and stock trading. Many user interactions, such as responding to a status update or selling shares of stock, are in direct response to the actions of other users.  In addition, the current generation of script-rich web interfaces performs many operations asynchronously without any action from, or even awareness by, the user.  Examples include web pages and rich client interfaces that update active friend lists, check for messages, or maintain stock tickers.  This leads to a very different model of user behavior than the traditional single-threaded, click-and-think design used by existing benchmarks.  As a result, one of the key design goals for Zephyr was to develop both a benchmark application and a workload generator that would allow us to capture the effect of these new workload patterns.

Zephyr Overview

An application-level benchmark typically consists of two main parts: the benchmark application and the workload driver.  The application is selected and designed to represent characteristics and technology choices that are typical of a certain class of applications.  The workload driver interacts with the benchmark application to simulate the behavior of typical users of the application.   It also captures the performance metrics that are used to quantify the performance of the application/infrastructure combination. Some benchmarks, including Zephyr, also provide a run harness that assists in the set-up and automation of benchmark runs.

Zephyr’s benchmark application is LiveAuction, which is a web application for managing and hosting real-time auctions. An auction hosted by LiveAuction consists of a number of items that will be placed up for bid in a set order.  Users are given only a limited time to bid before an item is sold and the next item is placed up for bid.  When an item is up for bid, all users attending the auction are presented with a description and image of the item.  Users see and respond to bids placed by other users. LiveAuction can support thousands of simultaneous auctions with large numbers of active users, with each user possibly attending multiple, simultaneous auctions.   The figure below shows the browser application used to interact with the LiveAuction application.  This figure shows the bidding screen for a user who is attending two auctions.  The current item, bid, and bid status for each auction are updated in real-time in response to bids placed by other users.

LiveAuctionScreenFigure 1. LiveAuction bidding screen

In addition to managing live auctions, LiveAuction provides auction and item search, profile management, historical data queries, image management, auction management, and other services that would be required by a user of the application.

LiveAuction uses a scalable architecture that allows deployments to be easily sized for a large range of user loads.  A full deployment of LiveAuction includes a wide variety of support services, such as load-balancing, caching, and messaging servers, as well as relational, NoSQL, and filesystem-based data stores supporting scalability for data with a variety of consistency requirements.  The figure below shows a full deployment of LiveAuction and the Zephyr workload driver.

logicalLayoutFullFigure 2. Logical layout for full Zephyr deployment

The following is a brief description of the role played by each tier.

Infrastructure Services

TCP Load Balancers: The simulated users on the workload driver address the application through a set of IP addresses mapped to the application’s external hostname.  The TCP load balancers jointly manage these IP addresses to ensure that all IP addresses remain available in the event of a failure. The TCP load balancers distribute the load across the web servers while maintaining SSL/TLS session affinity.

Messaging Servers: The application nodes use the messaging backbone to distribute work and state-change information regarding active auctions.

Application Services

Web Servers: The web servers terminate SSL, serve static content, act as load-balancing reverse proxies for the application servers, and provide a proxy cache for application content, such as images returned by the application servers.

Application Servers: The application servers run Java servlet containers in which the application services are deployed.  The LiveAuction application services use a stateless implementation with a RESTful interface that simplifies scaling.

Data Services

Relational Database: The relational database is used for all data that is involved in transactions.  This includes user account information, as well as auction, item, and high-bid data.

NoSQL Data Server:  The NoSQL Document Store is used to store image metadata as well as activity data such as auction attendance information and bid records. It can also be used to store uploaded images. Using the NoSQL store as an image store allows the application to take advantage of its sharding capabilities to easily scale the I/O capacity for image storage.

File Server: The file server is used exclusively to store item images uploaded by users.  Note that the file server is optional, as the images can be stored and served from the NoSQL document store.

Zephyr currently includes configuration support for deploying LiveAuction using the following services:

  • Virtual IP Address Management: Keepalived
  • TCP Load Balancer: HAProxy
  • Web Server: Apache Httpd and Nginx
  • Application Server:  Apache Tomcat with EHcache for in-memory caching
  • Messaging Server: RabbitMQ
  • Relational Database: MySQL and PostgreSQL
  • NoSQL Data Store: MongoDB
  • Network Filesystem: NFS

Additional implementations will be supported in future releases.

Zephyr can be deployed with different subsets of the infrastructure and application services.  For example, the figure below shows a minimal deployment of Zephyr with a single application server and the supporting data services.  In this configuration, the application server performs the tasks handled by the web server in a larger deployment.

logicalLayoutMinimalFigure 3. Logical layout for a minimal Zephyr deployment

The Zephyr workload driver has been developed to drive HTTP-based loads for modern scalable web applications.  It can simulate workloads for applications that incorporate asynchronous behaviors using embedded JavaScript, and those requiring complex data-driven behaviors, as in web applications with significant inter-user interaction.  The Zephyr workload driver uses an asynchronous design with a small number of threads supporting a large number of simulated users. Simulated users may have multiple active asynchronous activities which share state information, and complex workload patterns can be specified with control-flow decisions made based on retrieved state and operation history. These features allow us to efficiently simulate workloads that would be presented to web applications by rich web clients using asynchronous JavaScript operations.

The Zephyr workload driver also monitors quality-of-service (QoS) metrics for both the LiveAuction application and the overall workload. The application-level QoS requirements are based on the 99th percentile response-times for the individual operations.  An operation represents a single action performed by a user or embedded script, and may consist of multiple HTTP exchanges.  The workload-level QoS requirements define the required mix of operations that must be performed by the users during the workload’s steady state.  This mix must be consistent from run to run in order for the results to be comparable.  In order for a run of the benchmark to pass, all QoS requirements must be satisfied.

Zephyr also includes a run harness that automates most of the steps involved in configuring and running the benchmark.  The harness takes as input a configuration file that describes the deployment configuration, the user load, and many service-specific tuning parameters.  The harness is then able to power on virtual machines, configure and start the various software services, deploy the software components of LiveAuction, run the workload, and collect the results, as well as the log, configuration, and statistics files from all of the virtual machines and services.  The harness also manages the tasks involved in loading and preparing the data in the data services before each run.


Scaling to large deployments is a key goal of Zephyr.  Therefore, it will be useful to conclude with some initial scalability data to show how we are doing in achieving that goal. There are many possible ways to scale up a deployment of LiveAuction.  For the sake of providing a straightforward comparison, we will focus on scaling out the number of application server instances in an otherwise fixed deployment configuration.  The CPU utilization of the application server is typically the performance bottleneck in a well-balanced LiveAuction deployment.

The figure below shows the logical layout of the VMs and services in this deployment.  Physically, all VMs reside on the same network subnet on the vSphere hosts, which are connected by a 10Gb Ethernet switch.

Blog1LayoutFigure 4. Deployment configuration for scaling results

The VMs in the LiveAuction deployment were distributed across three VMware vSphere 6 hosts.  Table 1 gives the hardware details of the hosts.

Host Name Host Vendor/Model Processors Memory
Host1 Dell PowerEdge R720
2-Socket Server
Intel® Xeon® CPU E5-2690 @ 2.90GHz
8 Core, 16 Thread
Host2 Dell PowerEdge R720
2-Socket Server
Intel® Xeon® CPU E5-2690 @ 2.90GHz
8 Core, 16 Thread
Host3 Dell PowerEdge R720
2-Socket Server
Intel® Xeon® CPU E5-2680 @ 2.70GHz
8 Core, 16 Thread

Table 1. vSphere 6 hosts for LiveAuction deployment

Table 2 shows the configuration of the VMs, and their assignment to vSphere hosts.  As the goal of these tests was to examine the scalability of the LiveAuction application, and not the characteristics of vSphere 6, we chose the VM sizing and assignment in part to avoid using more virtual CPUs than physical cores. While we did some tuning of the overall configuration, we did not necessarily obtain the optimal tuning for each of the service configurations.  The configuration was chosen so that the application server was the bottleneck as far as possible within the restrictions of the available physical servers.  In future posts, we will examine the tuning of the individual services, tradeoffs in deployment configurations, and best practices for deploying LiveAuction-like applications on vSphere.

Service Host VM vCPUs (each) VM Memory
HAProxy 1 Host1 2 8GB
HAProxy 2 Host2 2 8GB
HAProxy 3 Host3 2 8GB
Nginx 1, 2, and 3 Host3 2 8GB
RabbitMQ 1 Host2 1 2GB
RabbitMQ 2 Host1 1 2GB
Tomcat 1, 3, 5, 7, and 9 Host1 2 8GB
Tomcat 2, 4, 6, 8, and 10 Host2 2 8GB
MongoDB 1 and 3 Host2 1 32GB
MongoDB 2 and 4 Host1 1 32GB
PostgreSQL Host3 6 32GB

Table 2. Virtual machine configuration

Figure 5 shows the peak load that can be supported by this deployment configuration as the number of application servers is scaled from one to ten.  The peak load supported by a configuration is the maximum load at which the configuration can satisfy all of the QoS requirements of the workload.  The dotted line shows linear scaling of the maximum load extrapolated from the single application server result.  The actual scaling is essentially linear up to six application-server VMs.  At that point, the overall utilization of the physical servers starts to affect the ability to maintain linear scaling.  With seven application servers, the web-server tier becomes a scalability bottleneck, but there are not sufficient CPU cores available to add additional web servers.

It would require additional infrastructure to determine how far the linear scaling could be extended.  However, the current results provide strong evidence that with sufficient resources, Zephyr will be able to scale to support very large loads representing large numbers of users.

scalabilityFigure 5. Maximum supported users for increasing number of application servers


The discussion in this post has focused on the use of Zephyr as a traditional single-application benchmark with a focus on throughput and response-time performance metrics.  However, that only scratches the surface of our future plans for Zephyr.  We are currently working on extending Zephyr to capture more cloud-centric performance metrics.  These fall into two broad categories that we call multi-tenancy metrics and elasticity metrics.  Multi-tenancy metrics capture the performance characteristics of a cloud-deployed application in the presence of other applications co-located on the same physical resources.  The relevant performance metrics include isolation and fairness along with the traditional throughput and response-time metrics.  Elasticity metrics capture the performance characteristics of self-scaling applications in the presence of changing loads.  It is also possible to study elasticity metrics in the context of multi-tenancy environments, thus examining the impact of shared resources on the ability of an application to scale in a timely manner to satisfy user demands.  These are all exciting new areas of application performance, and we will have more to say about these subjects as we approach Zephyr 1.0.

Virtual SAN 6.0 Performance: Scalability and Best Practices

A technical white paper about Virtual SAN performance has been published. This paper provides guidelines on how to get the best performance for applications deployed on a Virtual SAN cluster.

We used Iometer to generate several workloads that simulate various I/O encountered in Virtual SAN production environments. These are shown in the following table.

Type of I/O workload Size (1KiB = 1024 bytes) Mixed Ratio Shows / Simulates
All Read 4KiB Maximum random read IOPS that a storage solution can deliver
Mixed Read/Write 4KiB 70% / 30% Typical commercial applications deployed in a VSAN cluster
Sequential Read 256KiB Video streaming from storage
Sequential Write 256KiB Copying bulk data to storage
Sequential Mixed R/W 256KiB 70% / 30% Simultaneous read/write copy from/to storage

In addition to these workloads, we studied Virtual SAN caching tier designs and the effect of Virtual SAN configuration parameters on the Virtual SAN test bed.

Virtual SAN 6.0 can be configured in two ways: Hybrid and All-Flash. Hybrid uses a combination of hard disks (HDDs) to provide storage and a flash tier (SSDs) to provide caching. The All-Flash solution uses all SSDs for storage and caching.

Tests show that the Hybrid Virtual SAN cluster performs extremely well when the working set is fully cached for random access workloads, and also for all sequential access workloads. The All-Flash Virtual SAN cluster, which performs well for random access workloads with large working sets, may be deployed in cases where the working set is too large to fit in a cache. All workloads scale linearly in both types of Virtual SAN clusters—as more hosts and more disk groups per host are added, Virtual SAN sees a corresponding increase in its ability to handle larger workloads. Virtual SAN offers an excellent way to scale up the cluster as performance requirements increase.

You can download Virtual SAN 6.0 Performance: Scalability and Best Practices from the VMware Performance & VMmark Community.

VMware vSphere 6 and Oracle 12c Scalability Study: Scaling Monster Virtual Machines

vSphere 6 introduces the ability to run virtual machines (VMs) with up to 128 virtual CPUs (vCPUs) and 4TB of RAM. This doubles the number of vCPUs supported from the previous version and increases the amount of RAM by four times. This new capability provides the potential for customers to run larger workloads than ever before in a virtual machine.

A series of tests were run with a virtual machine hosting Oracle 12c database instances. The DVD Store 2.1 open-source transactional workload was used to measure the performance of a large “Monster” VM on vSphere 6. The Oracle 12c database VM was scaled from 15 vCPUs all the way up to 120 vCPUs, and the maximum achieved throughput was measured. The full results and test details have been published in a white paper – VMware vSphere 6 and Oracle 12c Scalability Study: Scaling Monster Virtual Machines.

A four-socket Intel Xeon E7-4890 v2 processor based server with 1TB of memory was used to host the virtual machine for the tests.  Each Xeon E7-4890 v2 processor has 15 cores / 30 threads with Hyper Threading enabled for a total of 60 cores / 120 threads for the system. The diagram below shows the basic test configuration.



In all tests Hyper-Threading was enabled on the server, but in configurations where 60 vCPUs or less are assigned to the VM, Hyper-Threads are not used by the VM. This is a result of the default scheduling policy where the preference is for vCPUs to be scheduled on one thread per core before using the second thread of any core. This first set of results, shown below, is focused on the tests that scale up to 60 vCPUs. These tests show the scaling for the virtual machine without the use of Hyper-Threads


While vSphere 6 supports up to 128 vCPUs per VM, these tests were limited to 120 vCPUs due to the number of threads available on the server. The largest VM configuration used both hardware execution threads (Hyper-Threads) on all the processor cores in order to reach 120 vCPUs. In this case, there is one vCPU per execution thread.

Hyper-Threading doubles the number of execution threads, but it does not double performance. In order to measure the scale-up performance of the 120-vCPU VM, a 60-vCPU VM was configured with CPU affinity so that it was limited to only two of the server’s four sockets. In this configuration the 60-vCPU VM has one vCPU per execution thread, which is the same as the 120-vCPU VM.  Configuring a 60-vCPU VM in this way makes it easy to see the scale up performance at 120 vCPUs on this server with hyper-threads enabled.

The results of the scale-up testing using the 60-vCPU VM configured with CPU affinity to only 2 sockets and the 120-vCPU VM using all four sockets showed approximately linear scaling, as shown in the graph below.


For full test details and more test results please see the white paper that has was recently published.

The new larger “Monster” VM support in vSphere 6 allows for virtual machines that can support larger workloads than ever before with excellent performance. These tests show that large virtual machines running on vSphere 6 can scale up as needed to meet extreme performance demands.


Improvements in Network I/O Control for vSphere 6

Network I/O Control (NetIOC) in VMware vSphere 6 has been enhanced to support a number of exciting new features such as bandwidth reservations. A new paper published by the Performance Engineering team shows the performance of these new features. The paper also explores the performance impact of the new NetIOC algorithm. Later tests show that NetIOC offers a powerful way to achieve network resource isolation at minimal cost, in terms of latency and CPU utilization.

You can read the paper here.