vSGA, or Virtual Shared Graphics Acceleration, allows multiple VMware vSphere virtual machines to share hardware GPUs. We have advocated in previous blog articles the use of NVIDIA GRID vGPU technology, and this is a good solution for many use cases. In this blog, we look at the performance of vGPU technology vs. vSGA while limiting our testing to a workload generated by VMware Horizon 7 VDI desktops. Based on our measurements (we present some of that data in this blog) vSGA provides performance very close to vGPU when using a variety of software applications, including Microsoft Office, Adobe Acrobat, CAD viewers, YouTube video, and viewing or working with WebGL-based images.
Leaving CPU Hot Add at its default setting of disabled is one of the performance best practices that we have for large VMs. From the Performance Best Practices Guide for vSphere 6.7 U2:
CPU Hot Add is a feature that allows the addition of vCPUs to a running virtual machine. Enabling this feature, however, disables vNUMA for that virtual machine, resulting in the guest OS seeing a single vNUMA node. Without vNUMA support, the guest OS has no knowledge of the CPU and memory virtual topology of the ESXi host. This in turn could result in the guest OS making sub-optimal scheduling decisions, leading to reduced performance for applications running in large virtual machines. For this reason, enable CPU Hot Add only if you expect to use it. Alternatively, plan to power down the virtual machine before adding vCPUs, or configure the virtual machine with the maximum number of vCPUs that might be needed by the workload. If choosing the latter option, note that unused vCPUs incur a small amount of unnecessary overhead. Unused vCPUs could also cause the guest OS to make poor scheduling decisions within the virtual machine, again with the potential for reduced performance. For additional information see VMware KB article 2040375.
The reason for this is that if you enable CPU Hot Add, virtual NUMA is disabled. This means that the VM is not aware of which of its vCPUs are on the same NUMA node and might increase remote memory access. This removes the ability for the guest OS and applications to optimize based on NUMA and results in a possible reduction in performance.
Virtual NUMA (vNUMA) exposes NUMA topology to the guest operating system, allowing NUMA-aware guest operating systems and applications to make the most efficient use of the underlying hardware’s NUMA architecture. (For more information about NUMA, see page 27 in the Performance Best Practices Guide for vSphere 6.7 U2.)
To get an idea of what the performance impact can be by enabling CPU Hot Add, a simple test was run in our lab environment. This test found performance with the default setting of CPU Hot Add disabled performed from 2% to 8% better than when CPU Hot Add was enabled.
By Dave Jaffe, VMware Performance Engineering
A new white paper is available showing the advantages of running virtualized Spark Deep Learning workloads on Kubernetes.
Recent versions of Spark include support for Kubernetes. For Spark on Kubernetes, the Kubernetes scheduler provides the cluster manager capability provided by Yet Another Resource Negotiator (YARN) in typical Spark on Hadoop clusters. Upon receiving a spark-submit command to start an application, Kubernetes instantiates the requested number of Spark executor pods, each with one or more Spark executors.
The benefits of running Spark on Kubernetes are many: ease of deployment, resource sharing, simplifying the coordination between developer and cluster administrator, and enhanced security. A standalone Spark cluster on vSphere virtual machines running in the same configuration as a Kubernetes-managed Spark cluster on vSphere virtual machines were compared for performance using a heavy workload, and the difference imposed by Kubernetes was found to be insignificant.
Spark applications running in Standalone mode require that every Spark worker node be installed with the correct version of Spark, Python, Java, etc. This puts a burden on the IT administrator, who may be managing many Spark applications with different requirements, and it requires coordination between the administrator and the application developer. With Kubernetes, the developer only needs to create a container with the correct software, and the IT administrator just needs to manage the cluster using the fine-grained resource management tools to enable the different Spark workloads.
To compare Spark Standalone performance to Spark on Kubernetes performance, a Deep Learning workload, the Maximum Throughput Spark BigDL ResNet50 image classifier from VMware IoT Analytics Benchmark, was run on the same 16 worker nodes, first while configured as Spark worker nodes, then while configured as Kubernetes nodes. Then the number of nodes was reduced by four (by removing the four workers on host 4), and the same comparison was made using 12 nodes, then 8, then 4.
The relative results are shown below. The Spark Standalone and Spark on Kubernetes performance in terms of images per second classified was within ~1% of each other for all configurations. Performance scaled well for the Spark tests as the number of VMs increased from 4 (1 server) to 16 (4 servers).
All details are in the paper.
The VMware Performance team has published an updated paper detailing vCenter Server 6.7 performance in a remote offices and branch offices (ROBO) environment.
Many organizations today have a ROBO environment with local IT infrastructure. These remote locations usually have anywhere from a few servers running a few workloads to support local needs, to numerous servers spanning a large-scale datacenter. The distributed and remote nature of this infrastructure makes it hard to manage, difficult to protect, and costly to maintain. Further, the remote nature of servers makes it more challenging to perform important VM/host-related operations.
vSphere is designed to address these ROBO use cases, including IT infrastructure located in remote, distributed sites. VMware vCenter Server provides a centralized way to control and monitor the virtual infrastructure, including ESXi hosts, virtual machines, storage, and networking resources. It has been widely deployed in a ROBO environment to manage ESXi hosts that are distributed over large geographical distances over a wide range of networks with different network characteristics, including low/high bandwidth, network latency, and packet error rates. In the paper, we test:
- LAN with high-bandwidth and low-latency links.
- WAN with low-bandwidth and high-latency links.
- Various networks in between; for example, DSL, T1, 4G, 5G, …
We demonstrate that vCenter Server performs well in the ROBO environment for both network bandwidth use, as well as virtual machine and ESXi host task execution times. Instead of a bandwidth restriction, we observe that network latency has a bigger impact on the overall performance. As the network latency between vCenter Server and ESXi hosts increases, the average operation latency also increases. The experimental results also show how efficiently vCenter Server executes VM operations in high-latency networks: The average VM operation execution time increases much more slowly when network latency increases by several times.
By Lan Vu, Uday Kurkure, and Hari Sivaraman
Data scientists may use GPUs on vSphere that are dedicated to use by one virtual machine only for their modeling work, if they need to. Certain heavier machine learning workloads may well require that dedicated approach. However, there are also many ML workloads and user types that do not use a dedicated GPU continuously to its maximum capacity. This presents an opportunity for shared use of a physical GPU by more than one virtual machine/user. This article explores the performance of a shared-GPU setup like this, supported by the NVIDIA GRID product on vSphere, and presents performance test results that show that sharing is a feasible approach. The other technical reasons for sharing a GPU among multiple VMs are also described here. The article also gives best practices for determining how the sharing of a GPU may be done.
VMware vSphere supports NVIDIA GRID technology for multiple types of workloads. This technology virtualizes GPUs via a mediated passthrough mechanism. Initially, NVIDIA GRID supported GPU virtualization for graphics workloads only. But, since the introduction of Pascal GPU, NVIDIA GRID has supported GPU virtualization for both graphics and CUDA/machine learning workloads. With this support, multiple VMs running GPU-accelerated workloads like machine learning/deep learning (ML/DL) based on TensorFlow, Keras, Caffe, Theano, Torch, and others can share a single GPU by using a vGPU provided by GRID. This brings benefits in multiple use cases that we discuss on this post.
By Mark Ma
With the release of vSphere 6.7, VMware added iSER (iSCSI Extensions for RDMA) as a native supported storage protocol to ESXi. With iSER run over iSCSI, users can boost their vSphere performance just by replacing the regular NICs with RDMA-capable NICs. RDMA (Remote Direct Memory Access) allows the transfer of memory from one computer to another. This is a direct transfer and minimizes CPU/kernel involvement. By bypassing the kernel, we get extremely high I/O bandwidth and low latency. (To use RDMA, you must have an HCA/Host Channel Adapter device on both the source and destination.) In this blog, we compare standard iSCSI performance vs. iSER performance to see how iSER can release the full potential of your iSCSI storage.
By Dave Jaffe
A new white paper is available comparing Spark machine learning performance on an 8-server on-premises cluster vs. a similarly configured VMware Cloud on AWS cluster.
Here is what the VMware Cloud on AWS cluster looked like:
Three standard analytic programs from the Spark machine learning library (MLlib), K-means clustering, Logistic Regression classification, and Random Forest decision trees, were driven using spark-perf. In addition, a new, VMware-developed benchmark, IoT Analytics Benchmark, which models real-time machine learning on Internet-of-Things data streams, was used in the comparison. The benchmark is available from GitHub.
We published a paper that shows how VMware is helping advance PMEM technology by driving the virtualization enhancements in vSphere 6.7. The paper gives a detailed performance analysis of using PMEM technology on vSphere using various workloads and scenarios.
These are the key points that we cover in this white paper:
- We explain how PMEM can be configured and used in a vSphere environment.
- We show how applications with different characteristics can take advantage of PMEM in vSphere. Below are some of the use-cases:
- How PMEM device limits can be achieved under vSphere with little to no overhead of virtualization. We show virtual-to-native ratio along with raw bandwidth and latency numbers from fio, an I/O microbenchmark.
- How traditional relational databases like Oracle can benefit from using PMEM in vSphere.
- How scaling-out VMs in vSphere can benefit from PMEM. We used Sysbench with MySQL to show such benefits.
- How modifying applications (PMEM-aware) can get the best performance out of PMEM. We show performance data from such applications, e.g., an OLTP database like SQL Server and an in-memory database like Redis.
- Using vMotion to migrate VMs with PMEM which is a host-local device just like NVMe SSDs. We also characterize in detail, vMotion performance of VMs with PMEM.
- We outline some best practices on how to get the most out of PMEM in vSphere.
You’ve probably already heard about VMware Cloud on Amazon Web Services (VMC on AWS). It’s the same vSphere platform that has been running business critical applications for years, but now it’s available on Amazon’s cloud infrastructure. Following up on the many tests that we have done with Oracle databases on vSphere, I was able to get some time on a VMC on AWS setup to see how Oracle databases perform in this new environment.
It is important to note that VMC on AWS is vSphere running on bare metal servers in Amazon’s infrastructure. The expectation is that performance will be very similar to “regular” onsite vSphere, with the added advantage that the hardware provisioning, software installation, and configuration is already done and the environment is ready to go when you login. The vCenter interface is the same, except that it references the Amazon instance type for the server.
In a recent blog, the VMware vSphere team shared the following performance improvements in vSphere 6.7 vs. 6.5:
Moreover, with vSphere 6.7 vCSA delivers phenomenal performance improvements (all metrics compared at cluster scale limits, versus vSphere 6.5):
2X faster performance in vCenter operations per second
3X reduction in memory usage
3X faster DRS-related operations (e.g. power-on virtual machine)
As senior engineers within the VMware Performance and vSphere teams, we are writing this blog to provide more details regarding these numbers and to explain how we measured them. We also briefly explain some of the technical details behind these improvements.