vSGA, or Virtual Shared Graphics Acceleration, allows multiple VMware vSphere virtual machines to share hardware GPUs. We have advocated in previous blog articles the use of NVIDIA GRID vGPU technology, and this is a good solution for many use cases. In this blog, we look at the performance of vGPU technology vs. vSGA while limiting our testing to a workload generated by VMware Horizon 7 VDI desktops. Based on our measurements (we present some of that data in this blog) vSGA provides performance very close to vGPU when using a variety of software applications, including Microsoft Office, Adobe Acrobat, CAD viewers, YouTube video, and viewing or working with WebGL-based images.
Leaving CPU Hot Add at its default setting of disabled is one of the performance best practices that we have for large VMs. From the Performance Best Practices Guide for vSphere 6.7 U2:
CPU Hot Add is a feature that allows the addition of vCPUs to a running virtual machine. Enabling this feature, however, disables vNUMA for that virtual machine, resulting in the guest OS seeing a single vNUMA node. Without vNUMA support, the guest OS has no knowledge of the CPU and memory virtual topology of the ESXi host. This in turn could result in the guest OS making sub-optimal scheduling decisions, leading to reduced performance for applications running in large virtual machines. For this reason, enable CPU Hot Add only if you expect to use it. Alternatively, plan to power down the virtual machine before adding vCPUs, or configure the virtual machine with the maximum number of vCPUs that might be needed by the workload. If choosing the latter option, note that unused vCPUs incur a small amount of unnecessary overhead. Unused vCPUs could also cause the guest OS to make poor scheduling decisions within the virtual machine, again with the potential for reduced performance. For additional information see VMware KB article 2040375.
The reason for this is that if you enable CPU Hot Add, virtual NUMA is disabled. This means that the VM is not aware of which of its vCPUs are on the same NUMA node and might increase remote memory access. This removes the ability for the guest OS and applications to optimize based on NUMA and results in a possible reduction in performance.
Virtual NUMA (vNUMA) exposes NUMA topology to the guest operating system, allowing NUMA-aware guest operating systems and applications to make the most efficient use of the underlying hardware’s NUMA architecture. (For more information about NUMA, see page 27 in the Performance Best Practices Guide for vSphere 6.7 U2.)
To get an idea of what the performance impact can be by enabling CPU Hot Add, a simple test was run in our lab environment. This test found performance with the default setting of CPU Hot Add disabled performed from 2% to 8% better than when CPU Hot Add was enabled.
By Dave Jaffe, VMware Performance Engineering
A new white paper is available showing the advantages of running virtualized Spark Deep Learning workloads on Kubernetes.
Recent versions of Spark include support for Kubernetes. For Spark on Kubernetes, the Kubernetes scheduler provides the cluster manager capability provided by Yet Another Resource Negotiator (YARN) in typical Spark on Hadoop clusters. Upon receiving a spark-submit command to start an application, Kubernetes instantiates the requested number of Spark executor pods, each with one or more Spark executors.
The benefits of running Spark on Kubernetes are many: ease of deployment, resource sharing, simplifying the coordination between developer and cluster administrator, and enhanced security. A standalone Spark cluster on vSphere virtual machines running in the same configuration as a Kubernetes-managed Spark cluster on vSphere virtual machines were compared for performance using a heavy workload, and the difference imposed by Kubernetes was found to be insignificant.
Spark applications running in Standalone mode require that every Spark worker node be installed with the correct version of Spark, Python, Java, etc. This puts a burden on the IT administrator, who may be managing many Spark applications with different requirements, and it requires coordination between the administrator and the application developer. With Kubernetes, the developer only needs to create a container with the correct software, and the IT administrator just needs to manage the cluster using the fine-grained resource management tools to enable the different Spark workloads.
To compare Spark Standalone performance to Spark on Kubernetes performance, a Deep Learning workload, the Maximum Throughput Spark BigDL ResNet50 image classifier from VMware IoT Analytics Benchmark, was run on the same 16 worker nodes, first while configured as Spark worker nodes, then while configured as Kubernetes nodes. Then the number of nodes was reduced by four (by removing the four workers on host 4), and the same comparison was made using 12 nodes, then 8, then 4.
The relative results are shown below. The Spark Standalone and Spark on Kubernetes performance in terms of images per second classified was within ~1% of each other for all configurations. Performance scaled well for the Spark tests as the number of VMs increased from 4 (1 server) to 16 (4 servers).
All details are in the paper.
By Karthik Ganesan and Jared Rosoff
At VMworld US 2019, VMware announced Project Pacific, an evolution of vSphere into a Kubernetes-native platform. Project Pacific (among other things) introduces a vSphere Supervisor Cluster, which enables you to run Kubernetes pods natively on ESXi (called vSphere Native Pods) with the same level of isolation as virtual machines. At VMworld, we claimed that vSphere Native Pods running on Project Pacific, isolated by the hypervisor, can achieve up to 8% better performance than pods on a bare-metal Linux Kubernetes node. While it may sound a bit counter-intuitive that virtualized performance is better than bare metal, let’s take a deeper look to understand how this is possible.
Why are vSphere Native Pods faster?
This benefit primarily comes from ESXi doing a better job at scheduling the natively run pods on the right CPUs, thus providing better localization to dramatically reduce the number of remote memory accesses. The ESXi CPU scheduler knows that these pods are independent entities and takes great efforts to ensure their memory accesses are within their respective local NUMA (non-uniform memory access) domain. This results in better performance for the workloads running inside these pods and higher overall CPU efficiency. On the other hand, the process scheduler in Linux may not provide the same level of isolation across NUMA domains.
The VMware Performance team has published an updated paper detailing vCenter Server 6.7 performance in a remote offices and branch offices (ROBO) environment.
Many organizations today have a ROBO environment with local IT infrastructure. These remote locations usually have anywhere from a few servers running a few workloads to support local needs, to numerous servers spanning a large-scale datacenter. The distributed and remote nature of this infrastructure makes it hard to manage, difficult to protect, and costly to maintain. Further, the remote nature of servers makes it more challenging to perform important VM/host-related operations.
vSphere is designed to address these ROBO use cases, including IT infrastructure located in remote, distributed sites. VMware vCenter Server provides a centralized way to control and monitor the virtual infrastructure, including ESXi hosts, virtual machines, storage, and networking resources. It has been widely deployed in a ROBO environment to manage ESXi hosts that are distributed over large geographical distances over a wide range of networks with different network characteristics, including low/high bandwidth, network latency, and packet error rates. In the paper, we test:
- LAN with high-bandwidth and low-latency links.
- WAN with low-bandwidth and high-latency links.
- Various networks in between; for example, DSL, T1, 4G, 5G, …
We demonstrate that vCenter Server performs well in the ROBO environment for both network bandwidth use, as well as virtual machine and ESXi host task execution times. Instead of a bandwidth restriction, we observe that network latency has a bigger impact on the overall performance. As the network latency between vCenter Server and ESXi hosts increases, the average operation latency also increases. The experimental results also show how efficiently vCenter Server executes VM operations in high-latency networks: The average VM operation execution time increases much more slowly when network latency increases by several times.
VMmark was originally developed to fill the need for a server consolidation benchmark for a rapidly changing datacenter that was becoming increasingly dominated by virtualization. The design of VMmark, which is a collection of workloads, gives us the ability to quickly change workload parameters to modify the behavior of the entire benchmark. This allows us to use VMmark to exercise technologies that were not available at the time the benchmark was designed. The VMmark 3 run rules provide for academic or research results publication using a modified version of the benchmark.
VMmark 3 was designed in 2015 when the memory size of a typical high-end 2 socket server was 768 GB. Each VMmark 3 tile was configured to use 156 GB of memory, allowing multiple tiles to be run on each server. A new technology, Intel Optane DC Persistent Memory, now allows up to 3 TB of memory in a 2 socket server, with plans to increase that even further. Testing the performance of this technology with an unmodified version of VMmark 3 wouldn’t be easy as we’d saturate CPU resources long before we could fully exercise this large amount of memory. Thankfully the flexible nature of VMmark allows us to modify it to consume significantly more memory with minimal changes in CPU usage.
The two primary VMmark workloads are Weathervane and DVD Store. Each can be modified to consume more memory. Weathervane, as configured for VMmark 3, uses 14 VMs. Thus while it would be possible to modify this application, doing so would be a time-consuming process. We therefore decided to look at DVD Store, which uses only four VMs. Most of the work is done in the DVD Store database VM which was our target for modification.
Determining the best configuration for DVD Store to utilize a larger amount of memory required multiple iterations of testing. We modified one test parameter of the DVD Store workload, and then examined the results to determine the effect on the VMmark tile. We were looking for larger memory usage with a minimal increase in CPU usage so that we could exercise the larger memory configuration without requiring additional CPUs. The following table lists the default configuration and the variables we changed:
|VM Memory Size||32 GB||128, 250 and 385 GB|
|Think Time||1 second||0.5, 0.9, 1.25, and 1.5 seconds|
|Number of Threads||24||36 and 48|
|Number of Searches||3||5, 7, and 9|
|Batch Search Size||3||5, 7, and 9|
|Database Size||100 GB||300 and 500 GB|
The final configuration that we determined to have the most increased memory usage while keeping the CPU usage moderate was 250 GB DS3DB VM memory size, 1.5 seconds think time, and 300 GB database size. All other parameters were kept at the default.
The following table lists the CPU and memory utilization of the default configuration and the “increased memory” configuration.
|Configuration||CPU Utilization||Memory Utilization|
|Increased Memory||24.1||350 GB|
We were able to almost triple the memory consumption of a single VMmark tile without increasing the CPU usage. Using this “increased memory” configuration for VMmark we can now see the effect of the additional memory provided by Intel Optane DC Persistent Memory in Memory Mode.
More detailed information about this configuration and the methodology used to refine it can be found in the Intel Optane DC Persistent Memory whitepaper. Detailed instructions to configure VMmark 3 to increase the memory footprint can be obtained by emailing the VMmark team at firstname.lastname@example.org. We encourage you to experiment with VMmark under academic rules for your own studies and to let us know if you have any questions.
Two leadership VMmark benchmark results have been published with AMD EPYC™ Generation 2 processors running VMware vSphere 6.7 Update 3 on a two-node two-socket cluster and a four-node cluster. VMware worked closely with AMD to enable support for AMD EPYC™ Generation 2 in the VMware vSphere 6.7 U3 release.
The VMmark benchmark is a free tool used by hardware vendors and others to measure the performance, scalability, and power consumption of virtualization platforms and has become the standard by which the performance of virtualization platforms is evaluated.
I’m excited to announce that the “Extreme Performance Series” is back for its 7th year with 14 sessions created and being presented by VMware’s best and most distinguished performance engineers, principals, architects, and gurus. You do not want to miss this years program as it’s chock full of advanced content, practical advice, and exciting technical details!
Spread across 4 different VMworld tracks, you’ll find these sessions full of performance details that you won’t get anywhere else at VMworld. They’ll also be recorded, so if the sessions you want to see aren’t being hosted in your region, you’ll still get access to it.
vSphere 5.1 introduced an inventory tagging feature that has been available in all later versions of vSphere, including vSphere 6.7. Tags let datacenter administrators organize different vSphere objects like datastores, virtual machines, hosts, and so on. This makes it easier to sort and search for objects that share a tag, among other things. For example, you might use tags to track a group of VMs that all have the same operating system.
Writing code to use tags can be challenging in large-scale environments: a straightforward use of VMware PowerCLI cmdlets may result in poor performance, and while direct Tagging Service APIs are faster, the documentation can be difficult to understand. In this blog, we show some practical examples of using PowerCLI and Tagging Service APIs to perform tag-related operations. We include some simple measurements to show the performance improvements when using the Tagging Service vs. cmdlets. The sample performance numbers are for illustrative purposes only. We describe the test setup in the Appendix.
Update: There is also a white paper on this topic, which references this blog. See VMware vSphere 6.7 Tagging Best Practices.
1. Connecting to PowerCLI and the Tagging Service
In this document, when we write “PowerCLI cmdlets,” we mean calls like Get-Tag, or Get-TagCategory. To access this API, simply open a PowerShell terminal and log in:
Connect-VIServer <vCenter server IP or FQDN> -User <username> -Pass <password>
Along with the recent release of VMware vSphere 6.7 U2, we published a new whitepaper that shows the performance of a new scheduler option that was included in the 6.7 U2 update. We referred to this new scheduler option internally as the “sibling” scheduler, but the official name is the side-channel aware scheduler version 2, or SCAv2. The whitepaper includes full details about SCAv1 and SCAv2, the L1TF security vulnerability that made them necessary, and the performance implications with several different workload types. This blog is a brief overview of the key points, but we recommend that you check out the full document.
In August of 2018, a security vulnerability known as L1TF, affecting systems using Intel processors, was revealed, and patches and remediations were also made available. Intel provided micro-code updates for its processors, operating system patches were made available, and VMware provided an update for vSphere. The full details of the vCenter and ESXi patches are in a VMware security advisory that links to individual KB articles.