Data scientists may use GPUs on vSphere that are dedicated to use by one virtual machine only for their modeling work, if they need to. Certain heavier machine learning workloads may well require that dedicated approach. However, there are also many ML workloads and user types that do not use a dedicated GPU continuously to its maximum capacity. This presents an opportunity for shared use of a physical GPU by more than one virtual machine/user. This article explores the performance of a shared-GPU setup like this, supported by the NVIDIA GRID product on vSphere, and presents performance test results that show that sharing is a feasible approach. The other technical reasons for sharing a GPU among multiple VMs are also described here. The article also gives best practices for determining how the sharing of a GPU may be done.
VMware vSphere supports NVIDIA GRID technology for multiple types of workloads. This technology virtualizes GPUs via a mediated passthrough mechanism. Initially, NVIDIA GRID supported GPU virtualization for graphics workloads only. But, since the introduction of Pascal GPU, NVIDIA GRID has supported GPU virtualization for both graphics and CUDA/machine learning workloads. With this support, multiple VMs running GPU-accelerated workloads like machine learning/deep learning (ML/DL) based on TensorFlow, Keras, Caffe, Theano, Torch, and others can share a single GPU by using a vGPU provided by GRID. This brings benefits in multiple use cases that we discuss on this post.
With the release of vSphere 6.7, VMware added iSER (iSCSI Extensions for RDMA) as a native supported storage protocol to ESXi. With iSER run over iSCSI, users can boost their vSphere performance just by replacing the regular NICs with RDMA-capable NICs. RDMA (Remote Direct Memory Access) allows the transfer of memory from one computer to another. This is a direct transfer and minimizes CPU/kernel involvement. By bypassing the kernel, we get extremely high I/O bandwidth and low latency. (To use RDMA, you must have an HCA/Host Channel Adapter device on both the source and destination.) In this blog, we compare standard iSCSI performance vs. iSER performance to see how iSER can release the full potential of your iSCSI storage.
A new white paper is available comparing Spark machine learning performance on an 8-server on-premises cluster vs. a similarly configured VMware Cloud on AWS cluster.
Here is what the VMware Cloud on AWS cluster looked like:
VMware Cloud on AWS configuration for performance tests
Three standard analytic programs from the Spark machine learning library (MLlib), K-means clustering, Logistic Regression classification, and Random Forest decision trees, were driven using spark-perf. In addition, a new, VMware-developed benchmark, IoT Analytics Benchmark, which models real-time machine learning on Internet-of-Things data streams, was used in the comparison. The benchmark is available from GitHub.
We published a paper that shows how VMware is helping advance PMEM technology by driving the virtualization enhancements in vSphere 6.7. The paper gives a detailed performance analysis of using PMEM technology on vSphere using various workloads and scenarios.
These are the key points that we cover in this white paper:
We explain how PMEM can be configured and used in a vSphere environment.
We show how applications with different characteristics can take advantage of PMEM in vSphere. Below are some of the use-cases:
How PMEM device limits can be achieved under vSphere with little to no overhead of virtualization. We show virtual-to-native ratio along with raw bandwidth and latency numbers from fio, an I/O microbenchmark.
How traditional relational databases like Oracle can benefit from using PMEM in vSphere.
How scaling-out VMs in vSphere can benefit from PMEM. We used Sysbench with MySQL to show such benefits.
How modifying applications (PMEM-aware) can get the best performance out of PMEM. We show performance data from such applications, e.g., an OLTP database like SQL Server and an in-memory database like Redis.
Using vMotion to migrate VMs with PMEM which is a host-local device just like NVMe SSDs. We also characterize in detail, vMotion performance of VMs with PMEM.
We outline some best practices on how to get the most out of PMEM in vSphere.
You’ve probably already heard about VMware Cloud on Amazon Web Services (VMC on AWS). It’s the same vSphere platform that has been running business critical applications for years, but now it’s available on Amazon’s cloud infrastructure. Following up on the many tests that we have done with Oracle databases on vSphere, I was able to get some time on a VMC on AWS setup to see how Oracle databases perform in this new environment.
It is important to note that VMC on AWS is vSphere running on bare metal servers in Amazon’s infrastructure. The expectation is that performance will be very similar to “regular” onsite vSphere, with the added advantage that the hardware provisioning, software installation, and configuration is already done and the environment is ready to go when you login. The vCenter interface is the same, except that it references the Amazon instance type for the server.
In a recent blog, the VMware vSphere team shared the following performance improvements in vSphere 6.7 vs. 6.5:
Moreover, with vSphere 6.7 vCSA delivers phenomenal performance improvements (all metrics compared at cluster scale limits, versus vSphere 6.5): 2X faster performance in vCenter operations per second 3X reduction in memory usage 3X faster DRS-related operations (e.g. power-on virtual machine)
As senior engineers within the VMware Performance and vSphere teams, we are writing this blog to provide more details regarding these numbers and to explain how we measured them. We also briefly explain some of the technical details behind these improvements.
This article is by Hari Sivaraman, Uday Kurkure, and Lan Vu from the Performance Engineering team at VMware.
Performance Comparison of Containerized Machine Learning Applications
Docker containers  are rapidly becoming a popular environment in which to run different applications, including those in machine learning [1, 2, 3]. NVIDIA supports Docker containers with their own Docker engine utility, nvidia-docker , which is specialized to run applications that use NVIDIA GPUs.
The nvidia-docker container for machine learning includes the application and the machine learning framework (for example, TensorFlow ) but, importantly, it does not include the GPU driver or the CUDA toolkit.
Docker containers are hardware agnostic so, when an application uses specialized hardware like an NVIDIA GPU that needs kernel modules and user-level libraries, the container cannot include the required drivers. They live outside the container.
In our third episode of machine learning performance with vSphere 6.x, we look at the virtual GPU vs. the physical GPU. In addition, we extend the performance results of machine learning workloads using VMware DirectPath I/O (passthrough) vs. NVIDIA GRID vGPU that have been partially addressed in previous episodes:
Performance is one of the biggest concerns that keeps high performance computing (HPC) users from choosing virtualization as the solution for deploying HPC applications despite virtualization benefits such as reduced administration costs, resource utilization efficiency, energy saving, and security. However, with the constant evolution of virtualization technologies, the performance gaps between bare metal and virtualization have almost disappeared, and, in some use cases, virtualized applications can achieve better performance than running on bare metal because of the intelligent and highly optimized resource utilization of hypervisors. For example, a prior study  shows that vector machine applications running on a virtualized cluster of 10 servers have a better execution time than running on bare metal.
A new white paper is available showing how to best deploy and configure vSphere 6.5 for Big Data applications such as Hadoop and Spark running on a cluster with fast processors, large memory, and all-flash storage (Non-Volatile Memory Express storage and solid state disks). Hardware, software, and vSphere configuration parameters are documented, as well as tuning parameters for the operating system, Hadoop, and Spark.
The best practices were tested on a 13-server cluster, with Hadoop installed on vSphere as well as on bare metal. Workloads for both Hadoop (TeraSort and TestDFSIO) and Spark Machine Learning Library routines (K-means clustering, Logistic Regression classification, and Random Forest decision trees) were run on the cluster. Configurations with 1, 2, and 4 VMs per host were tested as well as bare metal. Among the 4 virtualized configurations, 4 VMs per host ran fastest due to the best utilization of storage as well as the highest percentage of data transfer within a server. The 4 VMs per host configuration also ran faster than bare metal on all Hadoop and Spark tests but one.
VMmark 3.0, VMware’s multi-host virtualization benchmark is generally available here. VMmark3 is a free cluster-level benchmark that measures the performance, scalability, and power of virtualization platforms.
VMmark3 leverages much of previous VMmark generations’ technologies and design. It continues to utilize a unique tile-based heterogeneous workload application design. It also deploys the platform-level workloads found in VMmark2 such as vMotion, Storage vMotion, and Clone & Deploy. In addition to incorporating new and updated application workloads and infrastructure operations, VMmark3 also introduces a new fully automated provisioning service that greatly reduces deployment complexity and time.