vSGA, or Virtual Shared Graphics Acceleration, allows multiple VMware vSphere virtual machines to share hardware GPUs. We have advocated in previous blog articles the use of NVIDIA GRID vGPU technology, and this is a good solution for many use cases. In this blog, we look at the performance of vGPU technology vs. vSGA while limiting our testing to a workload generated by VMware Horizon 7 VDI desktops. Based on our measurements (we present some of that data in this blog) vSGA provides performance very close to vGPU when using a variety of software applications, including Microsoft Office, Adobe Acrobat, CAD viewers, YouTube video, and viewing or working with WebGL-based images.
Leaving CPU Hot Add at its default setting of disabled is one of the performance best practices that we have for large VMs. From the Performance Best Practices Guide for vSphere 6.7 U2:
CPU Hot Add is a feature that allows the addition of vCPUs to a running virtual machine. Enabling this feature, however, disables vNUMA for that virtual machine, resulting in the guest OS seeing a single vNUMA node. Without vNUMA support, the guest OS has no knowledge of the CPU and memory virtual topology of the ESXi host. This in turn could result in the guest OS making sub-optimal scheduling decisions, leading to reduced performance for applications running in large virtual machines. For this reason, enable CPU Hot Add only if you expect to use it. Alternatively, plan to power down the virtual machine before adding vCPUs, or configure the virtual machine with the maximum number of vCPUs that might be needed by the workload. If choosing the latter option, note that unused vCPUs incur a small amount of unnecessary overhead. Unused vCPUs could also cause the guest OS to make poor scheduling decisions within the virtual machine, again with the potential for reduced performance. For additional information see VMware KB article 2040375.
The reason for this is that if you enable CPU Hot Add, virtual NUMA is disabled. This means that the VM is not aware of which of its vCPUs are on the same NUMA node and might increase remote memory access. This removes the ability for the guest OS and applications to optimize based on NUMA and results in a possible reduction in performance.
Virtual NUMA (vNUMA) exposes NUMA topology to the guest operating system, allowing NUMA-aware guest operating systems and applications to make the most efficient use of the underlying hardware’s NUMA architecture. (For more information about NUMA, see page 27 in the Performance Best Practices Guide for vSphere 6.7 U2.)
To get an idea of what the performance impact can be by enabling CPU Hot Add, a simple test was run in our lab environment. This test found performance with the default setting of CPU Hot Add disabled performed from 2% to 8% better than when CPU Hot Add was enabled.
By Dave Jaffe, VMware Performance Engineering
A new white paper is available showing the advantages of running virtualized Spark Deep Learning workloads on Kubernetes.
Recent versions of Spark include support for Kubernetes. For Spark on Kubernetes, the Kubernetes scheduler provides the cluster manager capability provided by Yet Another Resource Negotiator (YARN) in typical Spark on Hadoop clusters. Upon receiving a spark-submit command to start an application, Kubernetes instantiates the requested number of Spark executor pods, each with one or more Spark executors.
The benefits of running Spark on Kubernetes are many: ease of deployment, resource sharing, simplifying the coordination between developer and cluster administrator, and enhanced security. A standalone Spark cluster on vSphere virtual machines running in the same configuration as a Kubernetes-managed Spark cluster on vSphere virtual machines were compared for performance using a heavy workload, and the difference imposed by Kubernetes was found to be insignificant.
Spark applications running in Standalone mode require that every Spark worker node be installed with the correct version of Spark, Python, Java, etc. This puts a burden on the IT administrator, who may be managing many Spark applications with different requirements, and it requires coordination between the administrator and the application developer. With Kubernetes, the developer only needs to create a container with the correct software, and the IT administrator just needs to manage the cluster using the fine-grained resource management tools to enable the different Spark workloads.
To compare Spark Standalone performance to Spark on Kubernetes performance, a Deep Learning workload, the Maximum Throughput Spark BigDL ResNet50 image classifier from VMware IoT Analytics Benchmark, was run on the same 16 worker nodes, first while configured as Spark worker nodes, then while configured as Kubernetes nodes. Then the number of nodes was reduced by four (by removing the four workers on host 4), and the same comparison was made using 12 nodes, then 8, then 4.
The relative results are shown below. The Spark Standalone and Spark on Kubernetes performance in terms of images per second classified was within ~1% of each other for all configurations. Performance scaled well for the Spark tests as the number of VMs increased from 4 (1 server) to 16 (4 servers).
All details are in the paper.
By Karthik Ganesan and Jared Rosoff
At VMworld US 2019, VMware announced Project Pacific, an evolution of vSphere into a Kubernetes-native platform. Project Pacific (among other things) introduces a vSphere Supervisor Cluster, which enables you to run Kubernetes pods natively on ESXi (called vSphere Native Pods) with the same level of isolation as virtual machines. At VMworld, we claimed that vSphere Native Pods running on Project Pacific, isolated by the hypervisor, can achieve up to 8% better performance than pods on a bare-metal Linux Kubernetes node. While it may sound a bit counter-intuitive that virtualized performance is better than bare metal, let’s take a deeper look to understand how this is possible.
Why are vSphere Native Pods faster?
This benefit primarily comes from ESXi doing a better job at scheduling the natively run pods on the right CPUs, thus providing better localization to dramatically reduce the number of remote memory accesses. The ESXi CPU scheduler knows that these pods are independent entities and takes great efforts to ensure their memory accesses are within their respective local NUMA (non-uniform memory access) domain. This results in better performance for the workloads running inside these pods and higher overall CPU efficiency. On the other hand, the process scheduler in Linux may not provide the same level of isolation across NUMA domains.
The VMware Performance team has published an updated paper detailing vCenter Server 6.7 performance in a remote offices and branch offices (ROBO) environment.
Many organizations today have a ROBO environment with local IT infrastructure. These remote locations usually have anywhere from a few servers running a few workloads to support local needs, to numerous servers spanning a large-scale datacenter. The distributed and remote nature of this infrastructure makes it hard to manage, difficult to protect, and costly to maintain. Further, the remote nature of servers makes it more challenging to perform important VM/host-related operations.
vSphere is designed to address these ROBO use cases, including IT infrastructure located in remote, distributed sites. VMware vCenter Server provides a centralized way to control and monitor the virtual infrastructure, including ESXi hosts, virtual machines, storage, and networking resources. It has been widely deployed in a ROBO environment to manage ESXi hosts that are distributed over large geographical distances over a wide range of networks with different network characteristics, including low/high bandwidth, network latency, and packet error rates. In the paper, we test:
- LAN with high-bandwidth and low-latency links.
- WAN with low-bandwidth and high-latency links.
- Various networks in between; for example, DSL, T1, 4G, 5G, …
We demonstrate that vCenter Server performs well in the ROBO environment for both network bandwidth use, as well as virtual machine and ESXi host task execution times. Instead of a bandwidth restriction, we observe that network latency has a bigger impact on the overall performance. As the network latency between vCenter Server and ESXi hosts increases, the average operation latency also increases. The experimental results also show how efficiently vCenter Server executes VM operations in high-latency networks: The average VM operation execution time increases much more slowly when network latency increases by several times.
VMmark was originally developed to fill the need for a server consolidation benchmark for a rapidly changing datacenter that was becoming increasingly dominated by virtualization. The design of VMmark, which is a collection of workloads, gives us the ability to quickly change workload parameters to modify the behavior of the entire benchmark. This allows us to use VMmark to exercise technologies that were not available at the time the benchmark was designed. The VMmark 3 run rules provide for academic or research results publication using a modified version of the benchmark.
VMmark 3 was designed in 2015 when the memory size of a typical high-end 2 socket server was 768 GB. Each VMmark 3 tile was configured to use 156 GB of memory, allowing multiple tiles to be run on each server. A new technology, Intel Optane DC Persistent Memory, now allows up to 3 TB of memory in a 2 socket server, with plans to increase that even further. Testing the performance of this technology with an unmodified version of VMmark 3 wouldn’t be easy as we’d saturate CPU resources long before we could fully exercise this large amount of memory. Thankfully the flexible nature of VMmark allows us to modify it to consume significantly more memory with minimal changes in CPU usage.
The two primary VMmark workloads are Weathervane and DVD Store. Each can be modified to consume more memory. Weathervane, as configured for VMmark 3, uses 14 VMs. Thus while it would be possible to modify this application, doing so would be a time-consuming process. We therefore decided to look at DVD Store, which uses only four VMs. Most of the work is done in the DVD Store database VM which was our target for modification.
Determining the best configuration for DVD Store to utilize a larger amount of memory required multiple iterations of testing. We modified one test parameter of the DVD Store workload, and then examined the results to determine the effect on the VMmark tile. We were looking for larger memory usage with a minimal increase in CPU usage so that we could exercise the larger memory configuration without requiring additional CPUs. The following table lists the default configuration and the variables we changed:
|VM Memory Size||32 GB||128, 250 and 385 GB|
|Think Time||1 second||0.5, 0.9, 1.25, and 1.5 seconds|
|Number of Threads||24||36 and 48|
|Number of Searches||3||5, 7, and 9|
|Batch Search Size||3||5, 7, and 9|
|Database Size||100 GB||300 and 500 GB|
The final configuration that we determined to have the most increased memory usage while keeping the CPU usage moderate was 250 GB DS3DB VM memory size, 1.5 seconds think time, and 300 GB database size. All other parameters were kept at the default.
The following table lists the CPU and memory utilization of the default configuration and the “increased memory” configuration.
|Configuration||CPU Utilization||Memory Utilization|
|Increased Memory||24.1||350 GB|
We were able to almost triple the memory consumption of a single VMmark tile without increasing the CPU usage. Using this “increased memory” configuration for VMmark we can now see the effect of the additional memory provided by Intel Optane DC Persistent Memory in Memory Mode.
More detailed information about this configuration and the methodology used to refine it can be found in the Intel Optane DC Persistent Memory whitepaper. Detailed instructions to configure VMmark 3 to increase the memory footprint can be obtained by emailing the VMmark team at firstname.lastname@example.org. We encourage you to experiment with VMmark under academic rules for your own studies and to let us know if you have any questions.
Two leadership VMmark benchmark results have been published with AMD EPYC™ Generation 2 processors running VMware vSphere 6.7 Update 3 on a two-node two-socket cluster and a four-node cluster. VMware worked closely with AMD to enable support for AMD EPYC™ Generation 2 in the VMware vSphere 6.7 U3 release.
The VMmark benchmark is a free tool used by hardware vendors and others to measure the performance, scalability, and power consumption of virtualization platforms and has become the standard by which the performance of virtualization platforms is evaluated.
VMmark has been the go-to virtualization benchmark for over 12 years. It’s been used by partners, customers, and internally in a wide variety of technical applications. VMmark1, released in 2007, was the de-facto virtualization consolidation benchmark in a time when the overhead and feasibility of virtualization was still largely in question. In 2010, as server consolidation became less of an “if” and more of a “when,” VMmark2 introduced more of the rich vSphere feature set by incorporating infrastructure workloads (VMotion, Storage VMotion, and Clone & Deploy) alongside complex application workloads like DVD Store. Fast forward to 2017, and we released VMmark3, which builds on the previous versions by integrating an easy automation deployment service alongside complex multi-tier modern application workloads like Weathervane. To date, across all generations, we’ve had nearly 300 VMmark result publications (297 at the time of this writing) and countless internal performance studies.
Unsurprisingly, tech industry environments have continued to evolve, and so must the benchmarks we use to measure them. It’s in this vein that the VMware VMmark performance team has begun experimenting with other use cases that don’t quite fit the “traditional” VMmark benchmark. One example of a non-traditional use is Machine Learning and its execution within Kubernetes clusters. At the time of this writing, nearly 9% of the VMworld 2019 US sessions are about ML and Kubernetes. As such, we thought this might be a good time to provide an early teaser to VMmark ML and even point you at a couple of other performance-centric Machine Learning opportunities at VMworld 2019 US.
Although it’s very early in the VMmark ML development cycle, we understand that there’s a need for push-button-easy, vSphere-based Machine Learning performance analysis. As an added bonus, our prototype runs within Kubernetes, which we believe to be well-suited for this type of performance analysis.
Our internal-only VMmark ML prototype is currently streamlined to efficiently perform a limited number of operations very well as we work with partners, customers, and internal teams on how VMmark ML should be exercised. It is able to:
- Rapidly deploy Kubernetes within a vSphere environment.
- Deploy a variety of containerized ML workloads within our newly created VMmark ML Kubernetes cluster.
- Execute these ML workloads either in isolation or concurrently to determine the performance impact of architectural, hardware, and software design decisions.
VMmark ML development is still very fluid right now, but we decided to test some of these concepts/assumptions in a “real-world” situation. I’m fortunate to work alongside long-time DVD Store author and Big Data guru Dave Jaffe on VMmark ML. As he and Sr. Technical Marketing Architect Justin Murray were preparing for their VMworld US talk, “High-Performance Virtualized Spark Clusters on Kubernetes for Deep Learning [BCA1563BU]“, we thought this would be a good opportunity to experiment with VMmark ML. Dave was able to use the VMmark ML prototype to deploy a 4-node Kubernetes cluster onto a single vSphere host with a 2nd-Generation Intel® Xeon® Scalable processor (“Cascade Lake”) CPU. VMmark ML then pulled a previously stored Docker container with several MLperf workloads contained within it. Finally, as a concurrent execution exercise, these workloads were run simultaneously, pushing the CPU utilization of the server above 80%. Additionally, Dave is speaking about vSphere Deep Learning performance in his talk “Optimize Virtualized Deep Learning Performance with New Intel Architectures [MLA1594BU],“ where he and Intel Principal Engineer Padma Apparao explore the benefits of Vector Neural Network Instructions (VNNI). I definitely recommend either of these talks if you want a deep dive into the details of VNNI or Spark analysis.
Another great opportunity to learn about VMware Performance team efforts within the Machine Learning space is to attend the Hands-on-Lab Expert Lead Workshop, “Launch Your Machine Learning Workloads in Minutes on VMware vSphere [ELW-2048-01-EMT_U],” or take the accompanying lab. This is being led by another VMmark ML team member Uday Kurkure along with Staff Global Solutions Consultant Kenyon Hensler. (Sign up for the Expert Lead using the VMworld 2019 mobile application or on my.vmworld.com.)
Our goal after VMworld 2019 US is to continue discussions with partners, customers, and internal teams about how a benchmark like VMmark ML would be most useful. We also hope to complete our integration of Spark within Kubernetes on vSphere and reproduce some of the performance analysis done to date. Stay tuned to the performance blog for additional posts and details as they become available.
I’m excited to announce that the “Extreme Performance Series” is back for its 7th year with 14 sessions created and being presented by VMware’s best and most distinguished performance engineers, principals, architects, and gurus. You do not want to miss this years program as it’s chock full of advanced content, practical advice, and exciting technical details!
Spread across 4 different VMworld tracks, you’ll find these sessions full of performance details that you won’t get anywhere else at VMworld. They’ll also be recorded, so if the sessions you want to see aren’t being hosted in your region, you’ll still get access to it.
Along with the recent release of VMware vSphere 6.7 U2, we published a new whitepaper that shows the performance of a new scheduler option that was included in the 6.7 U2 update. We referred to this new scheduler option internally as the “sibling” scheduler, but the official name is the side-channel aware scheduler version 2, or SCAv2. The whitepaper includes full details about SCAv1 and SCAv2, the L1TF security vulnerability that made them necessary, and the performance implications with several different workload types. This blog is a brief overview of the key points, but we recommend that you check out the full document.
In August of 2018, a security vulnerability known as L1TF, affecting systems using Intel processors, was revealed, and patches and remediations were also made available. Intel provided micro-code updates for its processors, operating system patches were made available, and VMware provided an update for vSphere. The full details of the vCenter and ESXi patches are in a VMware security advisory that links to individual KB articles.