Home > Blogs > VMware VROOM! Blog > Tag Archives: Performance

Tag Archives: Performance

ESX IP Storage Troubleshooting Best Practice White Paper

We have published an ESX IP Storage Troubleshooting Best Practice white paper in which we recommend vSphere customers deploying ESX IP storage over 10G networks to include 10G packet capture systems as a best practice to ensure network visibility.

The white paper explores the challenges and alternatives for packet capture in a vSphere environment with IP storage (NFS, iSCSI) datastores over a 10G network, and explains why traditional techniques for capturing packet traces on 1G networks will suffer from severe limitations (capture drops and inaccurate timestamps) when used for 10G networks. Although commercial 10G packet capture systems are commonly available, they may be beyond the budget of some vSphere customers. We present the design of a self-assembled 10G packet capture solution that can be built using commercial components relatively inexpensively. The self-assembled solution is optimized for common troubleshooting scenarios where short duration packet captures can satisfy most analysis requirements.

Our experience troubleshooting a large number of IP storage issues has shown that the ability to capture and analyze packet traces in an ESX IP storage environment can significantly reduce the mean time to resolution for serious functional and performance issues. When reporting an IP storage problem to VMware or to a storage array vendor, an accompanying packet trace file is a great piece of evidence that can significantly reduce the time required by the responsible engineering teams to identify the problem.

Performance Comparison of Containerized Machine Learning Applications Running Natively with Nvidia vGPUs vs. in a VM – Episode 4

This article is by Hari Sivaraman, Uday Kurkure, and Lan Vu from the Performance Engineering team at VMware.

Performance Comparison of Containerized Machine Learning Applications

Docker containers [6] are rapidly becoming a popular environment in which to run different applications, including those in machine learning [1, 2, 3]. NVIDIA supports Docker containers with their own Docker engine utility, nvidia-docker [7], which is specialized to run applications that use NVIDIA GPUs.

The nvidia-docker container for machine learning includes the application and the machine learning framework (for example, TensorFlow [5]) but, importantly, it does not include the GPU driver or the CUDA toolkit.

Docker containers are hardware agnostic so, when an application uses specialized hardware like an NVIDIA GPU that needs kernel modules and user-level libraries, the container cannot include the required drivers. They live outside the container.

One workaround here is to install the driver inside the container and map its devices upon launch. This workaround is not portable since the versions inside the container need to match those in the native operating system.

The nvidia-docker engine utility provides an alternate mechanism that mounts the user-mode components at launch, but this requires you to install the driver and CUDA in the native operating system before launch. Both approaches have drawbacks, but the latter is clearly preferable.

In this episode of our series of blogs [8, 9, 10] on machine learning in vSphere using GPUs, we present a comparison of the performance of MNIST [4] running in a container on CentOS executing natively with MNIST running in a container inside a CentOS VM on vSphere. Based on our experiments, we demonstrate that running containers in a virtualized environment, like a CentOS VM on vSphere, suffers no performance penlty, while benefiting from the tremenduous management capabilities offered by the VMware vSphere platform.

Experiment Configuration and Methodology

We used MNIST [4] to compare the performance of containers running natively with containers running inside a VM. The configuration of the VM and the vSphere server we used for the “virtualized container” is shown in Table 1. The configuration of the physical machine used to run the container natively is shown in Table 2.

vSphere  6.0.0, build 3500742
Nvidia vGPU driver 367.53
Guest OS CentOS Linux release 7.4.1708 (Core)
CUDA driver 8.0
CUDA runtime 7.5
Docker 17.09-ce-rc2

Table 1. Configuration of VM used to run the nvidia-docker container

Nvidia driver 384.98
Operating system CentOS Linux release 7.4.1708 (Core)
CUDA driver 8.0
CUDA runtime 7.5
Docker 17.09-ce-rc2

⇑ Table 2. Configuration of physical machine used to run the nvidia-docker container

The server configuration we used is shown in Table 3 below. In our experiments, we used the NVIDIA M60 GPU in vGPU mode only. We did not use the Direct I/O mode. In the scenario in which we ran the container inside the VM, we first installed the NVIDIA vGPU drivers in vSphere and inside the VM, then we installed CUDA (driver 8.0 with runtime version 7.5), followed by Docker and nvidia-docker [7]. In the case where we ran the container natively, we installed the NVIDIA driver in CentOS running natively, followed by CUDA (driver 8.0 with runtime version 7.5),  Docker and finally, nvidia-docker [7]. In both scenarios we ran MNIST and we measured the run time for training using a wall clock.

 Figure 1. Testbed configuration for comparison of the performance of containers running natively vs. running in a VM

Model Dell PowerEdge R730
Processor type Intel® Xeon® CPU E5-2680 v3 @ 2.50GHz
CPU cores 24 CPUs, each @ 2.5GHz
Processor sockets 2
Cores per socket 14
Logical processors 48
Hyperthreading Active
Memory 768GB
Storage Local SSD (1.5TB), Storage Arrays, Local Hard Disks
GPUs 2x M60 Tesla

⇑ Table 3. Server configuration


The measured wall-clock run times for MNIST are shown in Table 4 for the two scenarios we tested:

  1. Running in an nvidia-docker container in CentOS running natively.
  2. Running in an nvidia-docker container inside a CentOS VM on vSphere.

From the data, we can clearly see that there is no measurable performance penalty for running a container inside a VM as compred to running it natively.

Configuration Run time for MNIST as measured by a wall clock
Nvidia-docker container in CentOS running natively 44 minutes 53 seconds
Nvidia-docker container running in a CentOS VM on vSphere 44  minutes 57 seconds

⇑ Table 4. Comparison of the run-time for MNIST running in a container on native CentOS vs. in a container in virtualized CentOS


  • Based on the results shown in Table 4, it is clear that there is no measurable performance impact due to running a containerized application in a virtual environment as opposed to running it natively. So, from a performance perspective, there is no penalty for using a virtualized environment.
  • It is important to note that since containers do not include the GPU driver or the CUDA environment, both of these components need to be installed separately. It is in this aspect that a virtualized environment offers a superior user experience; an nvidia-docker container in CentOS running natively requires that any existing GPU and CUDA drivers be removed if the version of the drivers does not match that required by the container. Uninstalling and re-installing the correct drivers is often a challenging and time consuming task. However, in a virtualized environment, you can, in advance, create and store in a repository, a number of CentOS VMs with different VGPU and CUDA drivers. When you need to run an application in an nvidia-docker container, just clone the VM with the correct drivers, load the container, and run with no performance penalty. In such a scenario, running in a virtualized environment does not require you to uninstall and re-install the correct drivers, which saves both time and considerable frustration. This issue of uninstalling and re-installing drivers in a native environment becomes considerably more difficult if there are multiple container users on the system; in such a scenario, all the containers need to be migrated to use the new drivers, or the user who needs a new driver will have to wait until all the other users are done before a system administrator can upgrade the GPU drivers on the native CentOS.

Future Work

In this blog, we presented the performance results of running MNIST in a single container. We plan to run MNIST in multiple containers running concurrently in both a virtualized environment and on CentOS executing natively, and report the measured run times. This will provide a comparison of the performance as we scale up the number of containers.


  1. Google Cloud Platform: Cloud AI. https://cloud.google.com/products/machine-learning/
  2. Wikipedia: Deep Learning. https://en.wikipedia.org/wiki/Deep_learning
  3. NVIDIA GPUs – The Engine of Deep Learning. https://developer.nvidia.com/deep-learning
  4. The MNIST Database of Handwritten Digits. http://yann.lecun.com/exdb/mnist/
  5. TensorFlow: An Open-Source Software Library for Machine Intelligence. https://www.tensorflow.org
  6. Wikipedia: Operating-System-Level Virtualization. https://en.wikipedia.org/wiki/Operating-system-level_virtualization
  7. NVIDIA Docker: GPU Server Application Deployment Made Easy. https://devblogs.nvidia.com/parallelforall/nvidia-docker-gpu-server-application-deployment-made-easy/
  8. Episode 1: Performance Results of Machine Learning with DirectPath I/O and GRID vGPU. https://blogs.vmware.com/performance/2016/10/machine-learning-vsphere-nvidia-gpus.html
  9. Episode 2: Machine Learning on vSphere 6 with NVIDIA GPUs. https://blogs.vmware.com/performance/2017/03/machine-learning-vsphere-6-5-nvidia-gpus-episode-2.html
  10. Episode 3: Performance Comparison of Native GPU to Virtualized GPU and Scalability of Virtualized GPUs for Machine Learning. https://blogs.vmware.com/performance/2017/10/episode-3-performance-comparison-native-gpu-virtualized-gpu-scalability-virtualized-gpus-machine-learning.html 

Performance of SQL Server 2017 for Linux VMs on vSphere 6.5

Microsoft SQL Server has long been one of the most popular applications for running on vSphere virtual machines.  Last year there was quite a bit of excitement when Microsoft announced they were bringing SQL Server to Linux.  Over the last year Microsoft has had quite a bit of interest in SQL Server for Linux and it was announced at Microsoft Ignite last month that it is now officially launched and generally available.

VMware and Microsoft have collaborated to validate and support the functionality and performance scalability of SQL Server 2017 on vSphere-based Linux VMs.  The results of that work show SQL Server 2017 for Linux installs easily and has great performance within VMware vSphere virtual machines. VMware vSphere is a great environment to be able to try out the new Linux version of SQL Server and be able to also get great performance.

Using CDB, a cloud database benchmark developed by the Microsoft SQL Server team, we were able to verify that the performance of SQL Server for Linux in a vSphere virtual machine was similar to other non-virtualized and virtualized operating systems or platforms.

Our initial reference test size was relatively small, so we wanted to try out testing larger sizes to see how well SQL Server 2017 for Linux performed as the VM size was scaled up.  For the test, we used a four socket Intel Xeon E7-8890 v4 (Broadwell)-based server with 96 cores (24 cores per socket).  The initial test began with a 24 virtual CPU VM to match the number of physical cores of a single socket.  Additional tests were run by increasing the size of the VM by 24 vCPUs for each test until, in the final test, the VM had 96 total vCPUs.  We configured the virtual machine with 512 GB of RAM and separate log and data disks on an SSD-based Fibre Channel SAN.  We used the same best practices for SQL Server for Linux as what we normally use for the windows version as documented in our published best practices guide for SQL Server on vSphere.

The results showed that SQL Server 2017 for Linux scaled very well as the additional vCPUs were added to the virtual machine. SQL Server 2017 for Linux is capable of scaling up to handle very large databases on VMware vSphere 6.5 Linux virtual machines.

Skylake Update – Oracle Database Performance on vSphere 6.5 Monster Virtual Machines

We were able to get one of the new four-socket Intel Skylake based servers and run some more tests. Specifically we used the Xeon Platinum 8180 processors with 28 cores each. The new data has been added to the Oracle Monster Virtual Machine Performance on VMware vSphere 6.5 whitepaper. Please check out the paper for the full details and context of these updates.

The generational testing in the paper now includes a fifth generation with a 112 vCPU virtual machine running on the Skylake based server. Performance gain from the initial 40 vCPU VM on Westmere-EX to the Skylake based 112 vCPU VM is almost 4x.

The performance gained from Hyper-Threading was also updated and shows a 27% performance gain from the use of Hyper-Threads. The test was conducted by running two 112 vCPU VMs at the same time so that all 224 logical threads are active. The total throughput from the two VMs is then compared with the throughput from a single VM.

My colleague David Morse has also updated his SQL Server monster virtual machine whitepaper with Skylake data as well.

Updated – SQL Server VM Performance with vSphere 6.5, October 2017

Back in March, I published a performance study of SQL Server performance with vSphere 6.5 across multiple processor generations.  Since then, Intel has released a brand-new processor architecture: the Xeon Scalable platform, formerly known as Skylake.

Our team was fortunate enough to get early access to a server with these new processors inside – just in time for generating data that we presented to customers at VMworld 2017.

Each Xeon Platinum 8180 processor has 28 physical cores (pCores), and with four processors in the server, there was a whopping 112 pCores on one physical host!  As you can see, that extra horsepower provides nice database server performance scaling:

Generational SQL Server VM Database Performance

Generational SQL Server VM Database Performance

For more details and the test results, take a look at the updated paper:
Performance Characterization of Microsoft SQL Server on VMware vSphere 6.5

Highlights from: The Extreme Performance Series at VMworld 2017

Thank you to everyone who attended VMworld 2017 and especially those that participated in this year’s Extreme Performance Series.  While the conference is wrapped up, the content created and presented there, is now available to everyone! So grab a chair and advance your performance skill set.

Continue reading

Performance of Enterprise Web Applications in Docker Containers on VMware vSphere 6.5

Docker containers are growing in popularity as a deployment platform for enterprise applications. However, the performance impact of running these applications in Docker containers on virtualized infrastructures is not well understood. A new white paper is available that uses the open source Weathervane performance benchmark to investigate the performance of an enterprise web application running in Docker containers in VMware vSphere 6.5 virtual machines (VMs).  The results show that an enterprise web application can run in Docker on a VMware vSphere environment with not only no degradation of performance, but even better performance than a Docker installation on bare-metal.

Weathervane is used to evaluate the performance of virtualized and cloud infrastructures by deploying an enterprise web application on the infrastructure and then driving a load on the application.  The tests discussed in the paper use three different deployment configurations for the Weathervane application.

  • VMs without Docker containers: The application runs directly in the guest operating systems in vSphere 6.5 VMs, with no Docker containers.
  • VMs with Docker containers: The application runs in Docker containers, which run in guest operating systems in vSphere 6.5 VMs.
  • Bare-metal with Docker containers: The application runs in Docker containers, but the containers run in an operating system that is installed on a bare-metal server.

The figure below shows the peak results achieved when running the Weathervane benchmark in the three configurations.  The results using Docker containers include the impact of tuning options that are discussed in detail in the paper.

Some important things to note in these results:

  • The performance of the application using Docker containers in vSphere 6.5 VMs is almost identical to that of the same application running in VMs without Docker.
  • The application running in Docker containers in VMs outperforms the same application running in Docker containers on bare metal by about 5%. Most of this advantage can be attributed to the sophisticated algorithms employed by the vSphere 6.5 scheduler.

The results discussed in the paper, along with the results of previous investigations of Docker performance on vSphere, show that vSphere 6.5 is an ideal platform for deploying applications in Docker containers.

DRS Lens – A new UI dashboard for DRS

DRS Lens provides an alternative UI for a DRS enabled cluster. It gives a simple, yet powerful interface to monitor the cluster real time and provide useful analyses to the users. The UI is comprised of different dashboards in the form of tabs for each cluster being monitored.

Continue reading

The Extreme Performance Series at VMworld 2017

I’m excited to announce that the “Extreme Performance Series” is back for its 5th year, and with 7 additional sessions, it’s our largest year ever! These sessions are created and presented by VMware’s best and most distinguished performance engineers, principals, architects and gurus. You do not want to miss these advanced sessions.

Continue reading

Introducing VMmark3: A highly flexible and easily deployed benchmark for vSphere environments

VMmark 3.0, VMware’s multi-host virtualization benchmark is generally available here.  VMmark3 is a free cluster-level benchmark that measures the performance, scalability, and power of virtualization platforms.

VMmark3 leverages much of previous VMmark generations’ technologies and design.  It continues to utilize a unique tile-based heterogeneous workload application design. It also deploys the platform-level workloads found in VMmark2 such as vMotion, Storage vMotion, and Clone & Deploy.  In addition to incorporating new and updated application workloads and infrastructure operations, VMmark3 also introduces a new fully automated provisioning service that greatly reduces deployment complexity and time.

Continue reading