Home > Blogs > VMware VROOM! Blog

First VMmark 3.1 Publications, Featuring New Cascade Lake Processors

VMmark is a free tool used by hardware vendors and others to measure the performance, scalability, and power consumption of virtualization platforms.  If you’re unfamiliar with VMmark 3.x, each tile is a grouping of 19 virtual machines (VMs) simultaneously running diverse workloads commonly found in today’s data centers, including a scalable Web simulation, an E-commerce simulation (with backend database VMs), and standby/idle VMs.

As Joshua mentioned in a recent blog post, we released VMmark 3.1 in February, adding support for persistent memory, improving workload scalability, and better reflecting secure customer environments by increasing side-channel vulnerability mitigation requirements.

I’m happy to announce that today we published the first VMmark 3.1 results.  These results were obtained on systems meeting our industry-leading side-channel-aware mitigation requirements, thus continuing the benchmark’s ability to provide an indication of real-world performance.

Some mitigations for recently-discovered side-channel vulnerabilities (i.e., Spectre, Meltdown, and L1TF) incur significant performance impacts, while others have little or no impact.  Today’s VMmark results demonstrate that even when additional mitigations are in place, ESXi hosts using the new 2nd-Generation Intel® Xeon® Scalable processors obtain higher VMmark scores than comparable 1st-Generation Intel Xeon Scalable processors.  This is due to processor design improvements that reduce (or even negate) the performance impact of security mitigations, by mitigating some of the security vulnerabilities in hardware rather than in software.

These results, from Fujitsu, span all three VMmark publication categories:

  1. Performance Only (9.02 @ 9 tiles)
  2. Performance with Server Power (6.3290 @ 9 tiles)
  3. Performance with Server and Storage Power (3.5013 @ 9 tiles)

So, how does this new performance result with Cascade Lake processors compare to the previous generation with Skylake processors?  Hopefully a graph is worth a thousand words 😊…

Fujitsu Skylake to Cascade Lake Graph

As you can see, Fujitsu was able to achieve a higher score, while being able to run an additional tile (19 more VMs) and still meeting strict Quality-of-Service (QoS) compliance requirements imposed by the VMmark benchmark harness.

Industry-Leading Side-Channel Mitigation Requirements
Given the numerous security vulnerabilities recently identified, we set a high bar in VMmark 3.1 that requires all applicable security mitigations in benchmarked environments to best represent secure, real-world customer environments.

These are the current security mitigation requirements for VMmark 3.1:

VMmark 3.1 Security Mitigations Table

VMmark 3.1 Security Mitigations Table

Note: If “N/A” is listed, that vulnerability does not apply to that portion of the stack.

For more information about VMmark, please visit the VMmark product page.

If you have any questions or feedback, please leave us a comment below.  Thanks!

IoT Analytics Benchmark adds neural network–based deep learning with Keras and BigDL

The IoT Analytics Benchmark released last year dealt with an important Internet of Things use case—monitoring factory sensor data for impending failure conditions. This year, we are tackling an equally important use case—image classification. Whether used in facial recognition, license plate readers, inspection systems, or autonomous vehicles, neural network–based deep learning is making image detection and classification a viable technology.

As in the classic machine learning used in the original IoT Analytics Benchmark code (which used the Spark Machine Learning Library), the new deep learning code first trains a model using pre-labeled images and then deploys that model to infer the classification of new images. For IoT this inference step is the most important. Thus, the new programs, designated as IoT Analytics Benchmark DL, use previously trained models (included in the kit) to demonstrate inferencing that can be performed at the edge (on small gateway systems) or in scaled-out Spark clusters.

The programs run Keras and Intel’s BigDL image classifiers with the CIFAR10 image set. For each type of classifier, there is a program that sends the images as a series of encoded strings and a second program that reads those strings, converts them back to images, and infers which of the 10 CIFAR10 classes that image belongs to. The Keras classifier is a Python-based single node program for running on an IoT edge gateway. The BigDL classifier is a Spark-based distributed program. The programs use Intel’s BigDL library and the CIFAR10 dataset. (Also see Learning Multiple Layers of Features from Tiny Images, by Alex Krizhevsky.)

The CIFAR10 image set consists of 50,000 pre-labeled training images and 10,000 pre-labeled test images. Each image is a 32 x 32 color image from one of ten classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship or truck. For example, here’s a ship, frog, and truck:

   

Here’s what the Python-based Keras program looks like running a complex ResNet model on a small, virtualized edge gateway system:

First, the inference program is started on the VM on the edge gateway using a pre-trained ResNet model included in the kit:

[root@iotdemo ~]# nc -lk 10000 | python3 infer_cifar.py --modelPath cifar10_ResNet20v1_model_91470.h5
Using TensorFlow backend.
Loaded trained model cifar10_ResNet20v1_model_91470.h5
Start send program
2019-01-31T04:09:37Z: 100 images classified
...
2019-01-31T04:11:06Z: 1000 images classified
Inferenced 1000 images in 99.3 seconds or 10.1 images/second, with 916 or 91.6% correctly classified

Then, when the inference program prints out “Start send program”, the send program is started from a driver system, in this case the author’s Mac:

[djaffe@djaffe-a01 ~/code/neuralnetworks/BigDL]$ python3 send_images_cifar.py -s -i 100 -t 1000 | \
  nc 192.168.2.3 10000
Using TensorFlow backend.
2019-01-31T04:09:12Z: Loading and normalizing the CIFAR10 data
2019-01-31T04:09:22Z: Sending 100 images per second for a total of 1000 images with pixel mean
subtracted
2019-01-31T04:09:31Z: 100 images sent
...
2019-01-31T04:11:00Z: 1000 images sent
2019-01-31T04:11:00Z: Image stream ended

We are planning to use the new workloads in several VMware projects. As always, please send us your feedback and contributions!

VMmark 3.1 Released

It is my great pleasure to announce that VMmark 3.1 is generally available as of February 7, 2019!

What’s New?

This release adds support for persistent memory, improves workload scalability, and better reflects secure customer environments by increasing side-channel vulnerability mitigation requirements.

Visit our main VMmark HTML page for more information.

Please note that VMmark 3.0 will end of life on March 15th, 2019.

To learn more about VMmark3 see the introductory blog article here

vMotion across hybrid cloud: performance and best practices

VMware Cloud on AWS is a hybrid cloud service that runs the VMware software-defined data center (SDDC) stack in the Amazon Web Services (AWS) public cloud. The service automatically provisions and deploys a vSphere environment on a bare-metal AWS infrastructure, and lets you run your applications in a hybrid IT environment across your on-premises data centers and AWS global infrastructure. A key benefit of VMware Cloud on AWS is the ability to vMotion workloads back and forth from your on-premises data center to the AWS public cloud as capacity and data privacy require.

In this blog post, we share the results of our vMotion performance tests across our hybrid cloud environment that consisted of a vSphere on-premises data center located in Wenatchee, Washington and an SDDC hosted in an AWS cloud, in various scenarios including hybrid migration of a database server. We also describe the best practices to follow when migrating virtual machines by vMotion across hybrid cloud.

Test configuration

We set up the hybrid cloud environment with the following specifications:

VMware Cloud on AWS

  • 1-host SDDC instance with Amazon EC2 i3.metal (Intel Xeon E5-2686 @ 2.3 GHz, 36 cores, 512 GB)
  • SDDC version: vmc1.6 (M6 – Cycle 17)
  • Auto-provisioned with NSX networking and VSAN storage

On-premises host

  • Dell PowerEdge R730 (Intel Xeon E5-2699 v4 @ 2.2GHz, 22 cores, 1 TB memory)
  • ESXi and vCenter version: 6.7
  • Storage: Dell NVMe, VMFS 5 volume
  • Networking: Intel 1GbE NIC (shared 2*10GbE DX links between on-prem and AWS)

Figure 1: Logical layout of the hybrid cloud setup

Figure 1 illustrates the logical layout of our hybrid cloud environment. We deployed a single-host SDDC instance on AWS cloud. The SDDC was the latest M6 version and auto-configured with vSAN storage and NSX networking. Our on-premises data center, located in Washington state, featured hosts running ESXi 6.7.

AWS Direct Connect

We used high-speed AWS Direct Connect links for connectivity between VMware on-prem data center and AWS Oregon region. AWS Direct Connect provides a leased line from the AWS environment to the on-premises data center. VMware recommends you use this type of link because it guarantees sustained bandwidth during vMotion, which isn’t possible with VPN internet connections. In our environment, there was about 40 milliseconds of round-trip latency on the network.

L2 VPN tunnel

We set up a secure L2 VPN tunnel for the compute traffic that spanned the two vCenters. This connected the VMs on cloud and on-premises to the same address space (IP subnet). So, the VMs remained on the same subnet and kept their IP addresses even as we migrated them from on-premises to cloud and vice versa.

Figure 2: Extending VXLAN across on-premises and cloud using L2 VPN

As shown in figure 2, two NSX Edge VMs provided VPN capabilities and the bridge between the overlay world (VXLAN logical networks) and the physical infrastructure (IP networks). Each NSX Edge VM was equipped with two virtual interfaces (vNICs): one vNIC was used as an uplink to the physical network, and the second vNIC was used as the VXLAN trunk interface.

Hybrid linked mode

Figure 3: A single console to manage resources across on-premises and cloud environments

We created a hybrid linked mode between the cloud vCenter and on-premises vCenter. This allowed us to use a single console to manage all our inventory across the hybrid cloud.  As shown in Figure 3, the cloud inventory included a single Client-VM provisioned in the compute workload resource pool and the on-premises inventory included three VMs including NSX-Edge VM, Client-VM and a Server VM.

Measuring vMotion performance

The following metrics were used to understand the performance implications of vMotion:

  • Migration time: Total time taken for migration to complete
  • Switch-over time: Time during which the VM is quiesced to enable switchover from on-premises to cloud, and vice versa
  • Guest penalty: Performance impact on the applications running inside the VM during and after the migration

Benchmark methodology

We investigated the impact of hybrid vMotion on a Microsoft SQL Server database performance using the open-source DVD Store 3 (DS3) benchmark, which simulates many customers performing typical actions in an online DVD Store (logging in, browsing, buying, reviewing, and so on).

The test scenario used a Windows Server 2012 VM configured with 8 VCPUs, 8 GB memory, 40 GB disk, and a SQL Server database size of 5 GB. As shown in figures 2 and 3, we used two concurrent DS3 clients, one client running on-premises, and a second client running on the cloud. Each client used a load of five DS3 users with 0.02 seconds of think time. We started the migration during the steady-state period of the benchmark when the CPU utilization (esxtop %USED counter) of the SQL Server VM was close to 275%, and the average write IOPS was 80.

Test results

Figure 4: SQL Server throughput at given time: before, during, and after hybrid vMotions

Figure 4 plots the performance of a SQL Server VM in total orders processed per second during vMotion from on-premises to cloud, and vice versa. In our tests, both DS3 benchmark drivers were configured to report the performance data at a fine granularity of 1 second (the default is 10 seconds). As shown in figure 4, the impact on SQL Server throughput was minimal during vMotion in both directions. The total throughput remained steady in the range of 75 operations throughout the test period. The vMotion durations from on-premises to cloud, and vice versa were 415 seconds, and 382 seconds, respectively, with the network throughput ranging between 500 to 900 megabits per second (Mbps). The switch-over time was about 0.6 seconds in both vMotions. The few minor dips in throughput shown in the figure were due to the variance in available network bandwidth on the shared AWS Direct Connect link.

Figure 5: Breakdown of SQL Server throughput reported by the on-premises and cloud clients

Figure 5 illustrates the impact of network latency on the throughput. While the total SQL Server throughput remained steady during the entire test period, the throughput reported by both on-premises and cloud clients varied based on their proximity to the SQL Server VM. For instance, throughput reported by the on-premises client drops from 65 operations to 10 operations when the SQL Server VM was onboarded to the cloud and jumps back to 65 operations after the SQL Server VM is migrated back to the on-premises environment.

The throughput variation seen by the two DS3 clients is not unique to our hybrid cloud environment and can be explained by Little’s Law.

Little’s Law

In queueing theory, Little’s Law theorem states that the average number (L) of customers in a stable system is equal to the average arrival rate (λ) multiplied by the average time (W) that a customer spends in the system. Expressed algebraically, the law is: L = λ × W

Figure 6: Little’s Law applicability in hybrid cloud performance testing

Figure 6 shows how Little’s Law can be applied to our hybrid cloud environment to relate the DS3 users, SQL server throughput, SQL Server processing time, and the network latency. The formula derived in figure 6 explains the impact of the network latency on the throughput (orders per second) when the benchmark load (DS3 users) is fixed. It should be noted, however, that although the throughput reported by both the clients varied due to the network latency, the aggregate throughput remained a constant. This is because the throughput decrease seen by one client is offset by the throughput increase seen by the other client.

This illustrates how important it is for you to monitor your application dependencies when you migrate workloads to and from the cloud. For example, if your database VM depends on a Java application server VM, you should consider migrating both VMs together; otherwise, the overall application throughput will suffer due to slow responses and timeouts.

One way to monitor your application dependencies is to use VMware vRealize Network Insight, which can mitigate business risk by mapping application dependencies in both private and hybrid cloud environments.

vMotion Stun During Page Send (SDPS)

We also tested vMotion performance by doubling the intensity of the DS3 workload on both on-premises and cloud clients. Although vMotion succeeded, vmkernel logs indicated vMotion SDPS kicked-in during test scenarios that had a higher benchmark load. SDPS is an advanced feature of vMotion that ensures migration will not fail due to memory copy convergence issues. Whenever vMotion detects that the guest memory dirty rate is higher than the available network bandwidth, it injects microsecond latencies to guest execution to throttle the page dirty rate, so the network transfer can catch up with the dirty rate. So, we recommend you delay the vMotion of a heavily loaded VMs on hybrid cloud environments with shared bandwidth links, which will prevent slowdown in the guest execution.

To learn more about SDPS, see “VMware vSphere vMotion Architecture, Performance, and Best Practices.”

vMotion across multiple availability zones in the SDDC

Every AWS region has multiple availability zones (AZ). Amazon does not provide service level agreements beyond an availability zone. For reasons such as failover support, VMware Cloud on AWS customers can choose an SDDC deployment that spans multiple availability zones in a single AWS region.

There are certain vMotion performance implications with respect to the SDDC deployment configuration.

Figure 7.  vMotion peak network throughput in a single availability zone vs. multiple availability zones

As shown in figure 7, vMotion peak network throughput depends on the host placement in the SDDC.

This is because vMotion uses a single TCP stream in the VMware Cloud environment. If the vMotion source and destination hosts are within the same availability zone, vMotion peak throughput can reach as high as 10 gigabits per second (Gbps), limited only by the CPU core speed. However, if the source and destination hosts are across availability zones, vMotion peak throughput is governed by the AWS rate limiter. The throughput of any single TCP or UDP stream across availability zones is limited to 5 Gbps by the AWS rate limiter.

Conclusion

In summary, our performance test results show the following:

  • vMotion lets you migrate workloads seamlessly across traditional, on-premises data centers and software-defined data centers on AWS Cloud.
  • vMotion offers the same standard performance guarantees across hybrid cloud environment, which includes less than 1 second of vMotion execution switch-over time, and minimal impact on guest performance.

References

vSAN Performance Diagnostics Now Shows “Specific Issues and Recommendations” for HCIBench

By Amitabha Banerjee and Abhishek Srivastava

The vSAN Performance Diagnostics feature, which helps customers to optimize their benchmarks or their vSAN configurations to achieve the best possible performance, was first introduced in vSphere 6.5 U1. vSAN Performance Diagnostics is a “cloud connected” feature and requires participation in the VMware Customer Experience Improvement Program (CEIP). Performance metrics and data are collected from the vSAN cluster and are sent to the VMware Cloud. The data is analyzed and the results are sent back for display in the vCenter Client. These results are shown as performance issues, where each issue includes a problem with its description and a link to a KB article.

In this blog, we describe how vSAN Performance Diagnostics can be used with HCIBench and show the new feature in vSphere 6.7 U1 that provides HCIBench specific issues and recommendations.

What is HCIBench?

HCIBench (Hyper-converged Infrastructure Benchmark) is a standard benchmark that vSAN customers can use to evaluate the performance of their vSAN systems. HCIBench is an automation wrapper around the popular and proven VDbench open source benchmark tool that makes it easier to automate testing across an HCI cluster. HCIBench, available as a fling, simplifies and accelerates customer performance testing in a consistent and controlled way.

Example: Achieving Maximum IOPS

As an example, consider the following HCIBench workload that is run on a vSAN system:

Number of
VMs
Number of
Disks (vmdks)
to Test
Number of
Threads/Disk
Working Set
Percentage
Block Size Read/Write
Percentage
Randomness
Percentage
1 10 1 100 4 KB 0/100 100

If the goal is to achieve maximum IOs per second (IOPS), vSAN Performance Diagnostics for this workload yields the result shown in figure 1.

Figure 1

In this example, vSAN Performance Diagnostics reports, “The Outstanding IOs for the benchmark is too low to achieve the desired goal”. Here we can see that the feedback from vSAN Performance Diagnostics tells us about the problem, and a possible solution recommends that we increase the number of outstanding IOs to a value of 2 per host. The linked “Ask VMware” article explains this issue and what we must do with the benchmark in more detail.

The new HCIBench Exceptions and Recommendations feature removes the need to read through a KB article by precisely mapping recommendations to one (or more) configurable parameters of HCIBench.

Now let us check how vSAN Performance Diagnostics works with vSphere 6.7 U1, which has access to this new feature (figure 2).

Figure 2

Now we can clearly see an exact issue in our HCIBench workload configuration that provides the precise recommendation to resolve this issue with the message: “Increase number of threads per disk from 1 to 2”.

Let us monitor the current write IOPS generated by the benchmark for our reference. We go to Data Center–>Cluster–>Monitor–>vSAN–>Performance (figure 3).

Figure 3

Now, as part of the evaluation, we apply the recommendation generated by vSAN Performance Diagnostics and check its impact. We now run the new HCIBench workload configuration with 2 threads per disk and the following parameters:

Number of
VMs
Number of
Disks (vmdks)
to Test
Number of
Threads/Disk
Working Set
Percentage
Block Size Read/Write
Percentage
Randomness
Percentage
1 10 2 100 4KB 0/100 100

After the benchmark completes, we use vSAN Performance Diagnostics again to see if we can now achieve the required goal of maximum IOPS. The result from Performance Diagnostics now shows “No Issues were found”, which means that we are achieving good IOPS from the vSAN system (figure 4).

Figure 4

Now, as part of actual verification, we can see the change in IOPS after applying the recommendation. From figure 5 below (screenshot from Data Center–>Cluster–>Monitor–>vSAN–>Performance), we can clearly see that there is a 25-30% increase in IOPS after applying the recommendation, which verifies that it helped us achieve our goal.

Figure 5

We believe that this feature will be very useful for customers who want to tune their HCIBench workload for a desired goal.

Prerequisites

  • Please note that this feature is currently integrated with HCIBench 1.6.7 or later.
  • This feature is available for vSphere 6.7 U1 and newer releases. It will not be available to patch releases of vSphere 6.7.

Storage DRS Performance Improvements in vSphere 6.7

Virtual machine (VM) provisioning operations such as create, clone, and relocate involve the placement of storage resources. Storage DRS (sometimes seen as “SDRS”) is the resource management component in vSphere responsible for optimal storage placement and load balancing recommendations in the datastore cluster.

A key contributor to VM provisioning times in Storage DRS-enabled environments is the time it takes (latency) to receive placement recommendations for the VM disks (VMDKs). This latency particularly comes into play when multiple VM provisioning requests are issued concurrently.

Several changes were made in vSphere 6.7 to improve the time to generate placement recommendations for provisioning operations. Specifically, the level of parallelism was improved for the case where there are no storage reservations for VMDKs. This resulted in significant improvements in recommendation times when there are concurrent provisioning requests.

vRealize automation suite users who use blueprints to deploy large numbers of VMs quickly will notice the improvement in provisioning times for the case when no reservations are used.

Several performance optimizations were further made inside key steps of processing the Storage DRS recommendations. This improved the time to generate recommendations, even for standalone provisioning requests with or without reservations.

Test Setup and Results

We ran several performance tests to measure the improvement in recommendation times between vSphere 6.5 and vSphere 6.7. We ran these tests in our internal lab setup consisting of hundreds of VMs and few thousands of VMDKs. The VM operations are

  1. CreateVM – A single VM per thread is created.
  2. CloneVM – A single clone per thread is created.
  3. ReconfigureVM – A single VM per thread is reconfigured to add an additional VMDK.
  4. RelocateVM – A single VM per thread is relocated to a different datastore.
  5. DatastoreEnterMaintenance – Put a single datastore into maintenance mode. This is a non-concurrent operation.

Shown below are the relative improvements in recommendation times for VM operations at varying concurrencies. The y-axis has a numerical limit of 10, to allow better visualization of the relative values of the average recommendation time. 

The concurrent VM operations show an improvement of between 20x and 30x in vSphere 6.7 compared to vSphere 6.5 

Below we see the relative average time taken among all runs for serial operations.

The Datastore Enter Maintenance operation shows an improvement of nearly 14x in vSphere 6.7 compared to vSphere 6.5

With much faster storage DRS recommendation times, we expect customers to be able to provision multiple VMs much faster to service their in-house demands. Specifically, we expect VMware vRealize Automation suite users to hugely benefit from these improvements.

SPBM compliance check just got faster in vSphere 6.7 U1!

vSphere 6.7 U1 includes several enhancements in Storage Policy-Based Management (SPBM) to significantly reduce CPU use and generate a much faster response time for compliance checking operations.

SPBM is a framework that allows vSphere users to translate their workload’s storage requirements into rules called storage policies. Users can apply storage policies to virtual machines (VMs) and virtual machine disks (VMDKs) using the vSphere Client or through the VMware Storage Policy API’s rich set of managed objects and methods. One such managed object is PbmComplianceManager. One of its methods, PbmCheckCompliance, helps users determine whether or not the storage policy attached to their VM is being honored.

PbmCheckCompliance is automatically invoked soon after provisioning operations such as creating, cloning, and relocating a VM. It is also automatically triggered in the background once every 8 hours to help keep the compliance records up-to-date.

In addition, users can invoke the method when checking compliance for a VM storage policy in the vSphere Client, or through the VMware Storage Policy API method PbmCheckCompliance.

We did a study in our lab to compare the performance of PbmCheckCompliance between vSphere 6.5 U2 and vSphere 6.7 U1. We present this comparison in the form of charts showing the latency (normalized on a 100-point scale) of PbmCheckCompliance for varying numbers of VMs.

The following chart compares the performance of PbmCheckCompliance on VMFS and vSAN environments.

As we see from the above chart, PbmCheckCompliance returns results much faster in vSphere 6.7 U1 compared to 6.5 U2. The improvement is seen across all inventory sizes and all datastore types and become more prominent for larger inventories and higher numbers of VMs.

The enhancements also positively impact a similar method, PbmCheckRollupCompliance. This method also returns the compliance status of VMs and adds compliance results for all disks associated with these VMs. The following chart represents the performance comparison of PbmCheckRollupCompliance on VMFS and vSAN environments.

Our experiments show that compliance check operations are significantly faster and more light-weight in vSphere 6.7 U1.

Sharing GPU for Machine Learning/Deep Learning on VMware vSphere with NVIDIA GRID: Why is it needed? And How to share GPU?

By Lan Vu, Uday Kurkure, and Hari Sivaraman 

Data scientists may use GPUs on vSphere that are dedicated to use by one virtual machine only for their modeling work, if they need to. Certain heavier machine learning workloads may well require that dedicated approach. However, there are also many ML workloads and user types that do not use a dedicated GPU continuously to its maximum capacity. This presents an opportunity for shared use of a physical GPU by more than one virtual machine/user. This article explores the performance of a shared-GPU setup like this, supported by the NVIDIA GRID product on vSphere, and presents performance test results that show that sharing is a feasible approach. The other technical reasons for sharing a GPU among multiple VMs are also described here. The article also gives best practices for determining how the sharing of a GPU may be done.

VMware vSphere supports NVIDIA GRID technology for multiple types of workloads. This technology virtualizes GPUs via a mediated passthrough mechanism. Initially, NVIDIA GRID supported GPU virtualization for graphics workloads only. But, since the introduction of Pascal GPU, NVIDIA GRID has supported GPU virtualization for both graphics and CUDA/machine learning workloads. With this support, multiple VMs running GPU-accelerated workloads like machine learning/deep learning (ML/DL) based on TensorFlow, Keras, Caffe, Theano, Torch, and others can share a single GPU by using a vGPU provided by GRID. This brings benefits in multiple use cases that we discuss on this post.  

Each vGPU is allocated a dedicated amount of GPU memory and a vGPU profile specifies how much device memory each vGPU has and maximum number of vGPUs per physical GPU. For example, if you choose the P40-1q vGPU profile for Pascal P40 GPU, you can have up to 24 VMs with vGPU because P40 has total of 24 GB device memory. More information about virtualized GPUs on vSphere can be found at our previous blog here. 

Figure 1: NVIDIA GRID vGPU 

Why do we need to share GPUs?

Sharing GPUs can help increase system consolidation, resource utilization, and save deployment costs of ML/DL workloads. GPU-accelerated ML/DL workloads include training and inference tasks, and their GPU usage patterns are different. Training workloads are mostly run by data scientists and machine learning engineers during the research and development phase of an application. Because model training is just one of many tasks of ML application development, the need of GPUs by each user is usually irregular. For example, a data scientist does not spend the whole workday just training models because he/she has other things to do like checking & answering emails, attending meetings, researching and developing new ML algorithms, collecting and cleaning data, and so on. Hence, sharing GPUs among a group of multiple users helps increase the GPU utilization while not reducing much the performance benefits of GPU.  

To illustrate this scenario of using GPU for training, we conducted an experiment in which 3 VMs (or 3 users) used vGPU to share a single NVIDIA P40 GPU, and each VM ran the same ML/DL training workload at different times. ML workloads inside VM1 and VM2 were run at the times t1 and t2, so that about 25% of the GPU execution time of VM1 and VM2 were overlapped. VM3 ran its workload at t3, and it was the only GPU-based workload run at that timeframe. Figure 2 depicts this use case in which the black dash arrows indicate VMs access GPU concurrently. If you run your applications inside container, please also check out our previous blog post on running container-based applications inside a VM.

Figure 2: A use case of running multiple ML jobs on VMs with vGPUs

In our experiments, we used CentOS VMs with P40-1q vGPU profiles, 12 vCPUs, 60 GB memory, 96 GB disk, and ran TensorFlow-based training loads on those VMs, including complex language modeling using a recurrent neural network (RNN) with 1500 long short-term memory (LSTM) units per layer, on the Penn Treebank dataset (PTB) [1, 2], and handwriting recognition using a convolution neural network (CNN) with a MNIST dataset [3]. We ran the experiment on a Dell PowerEdge R740 with dual 18-core Intel Xeon Gold 6140 sockets and an NVIDIA Pascal P40 GPU.  

Figure 3 and Figure 4 show the normalized training time of VM1, VM2, and VM3 in which VM1 and VM2 have a performance impact of 16%–23%, while VM3 has no impact on the performance. In this experiment, we used the Best Effort scheduler of GRID which means VM3 fully utilized the GPU time during its application execution.

Figure 3Training time of Language Modeling 

Figure 4: Training time of Handwriting Recognition 

For inference workloads, the performance characteristics can vary based on the usage frequency of the GPU-based applications on the production environment. Less intensive GPU workloads allow more more apps running inside VMs sharing a single GPU. For example, a GPU-accelerated database app and other ML/DL apps can share the same GPUs on the same vSphere host if their performance requirements are still met.

How many vGPU per physical GPU is good?

The decision of sharing GPU among ML/DL workloads running on multiple VMs and how many VMs per physical GPU depends on the GPU usage of ML applications. When users or applications do not use the GPU very frequently, as shown in the previous example, sharing the GPU can bring huge benefits because it significantly reduces the hardware, operation, and management costs. In this case, you can assign more vGPU per physical GPU. If your workloads use GPU intensively and require continuous access to the GPU, sharing it can still bring some benefits because GPU-based application execution includes CPU time, GPU time, I/O time, and so on. Additionally, sharing a GPU helps fill the gap when applications spend time on CPU or I/O. However, in this case, you need to assign fewer vGPUs per physical GPU.

To determine how many VMs with vGPU per physical GPU are needed, you can base this on your evaluation of usage frequency or the GPU utilization history of the applications. In the case of GRID GPU on vSphere, you can monitor GPU utilization information by using the command nvidia-smi on the vSphere hypervisor. 

We evaluated the performance of ML/DL workloads, in the worst case, when all VMs use a GPU at the same time. To do this, we ran the same MNIST handwriting recognition training on multiple VMs with each vGPU concurrently sharing a single Pascal P40 GPU. Each VM had a P40-1q vGPU.  

The experiment in this scenario is depicted in Figure 5 with the number of concurrent VMs in our test ranging from 1 to 24 VMs.  

Figure 5: Running multiple ML jobs on VMs with vGPUs concurrently 

Figure 6 presents the normalized training time of this experiment. As the number of concurrent ML jobs increases, the training time of each job also increases because they share a single GPU. However, the increase of time is not as fast as the increase of VM. For example, when we have 24 VMs run concurrently, the execution time increases, at most, 17 times instead of 24 times or higherThis means that even in the worst case, where all VMs use the GPU at the same time, we still see the benefits of GPU sharing. Please note that in the typical use case of training as mentioned earliernot all users or applications use the GPU 24/7. If they do, you can just reduce the number of vGPUs per GPU until the expected performance and consolidation are reached

Figure 6: Training time with different number of VM

vGPU scheduling

When all VMs with GPU loads run concurrently, NVIDIA GRID manager schedules the jobs into the GPU based on time slicing. NVIDIA GRID supports three vGPU scheduling options: Best Effort, Equal Share, and Fixed Share. The selection of a vGPU scheduling option depends on use cases. The Best Effort scheduler allocates GPU time to VMs in a round-robin fashion. In the above experiments, we used the Best Effort scheduler. For some circumstances, a VM running a GPU-intensive application may affect the performance of a GPU-lightweight application running in other VMs. To avoid such performance impact and ensure quality of service (QoS), you can choose to switch to the Equal Share or Fixed Share scheduler. The Equal Share scheduler ensures equal share of GPU time for each powered-on VM. The Fixed Share scheduler gives a fixed share of GPU time to a VM based on the vGPU profile that is associated with each VM on the physical GPU.  

For performance comparison, we run the MNIST handwriting recognition training load using different schedulers: Best Effort and Equal Share for different number of VMs.  

Figure 7 presents the normalized training time and Figure 8 presents GPU utilization. As the number of VMs increase, Best Effort shows better performance because when a VM does not use its time slice, that time slice will be assigned to another VM that needs GPU. Meanwhile, for Equal Share, that time slice is always reserved for the VMs even if they do not utilize GPU at that moment. Therefore, Best Effort Scheduler has better GPU utilization as shown in Figure 7. 

Figure 7: Training time of Best Effort vs. Equal Share

Figure 8: GPU utilization of Best Effort vs. Equal Share 

Takeaways

  • Sharing a GPU among VMs using NVDIA GRID can help increase the consolidation of VMs with vGPU and reduce the hardware, operation, and management costs. 
  • The performance impact of sharing a GPU is small in typical use cases when the GPU used is infrequently by users. 
  • Choosing how many vGPUs per GPU is based on the ML/DL real load. For infrequent and lightweight GPU workloads, you can assign multiple vGPUs per GPU. For workloads that frequently use GPU, you should lower the number of vGPUs per GPU until the performance requirement is met.  

Acknowledgments

We would like to thank Aravind Bappanadu, Juan Garcia-Rovetta, Bruce Herndon, Don Sullivan, Charu Chaubal, Mohan Potheri, Gina Rosenthal, Justin Murray, Ziv Kalmanovich for their support of this work and thank Julie Brodeur for her help in reviewing and recommendations for this blog post.

References

[1] Wojciech Zaremba, Ilya Sutskever, Oriol Vinyals, “Recurrent Neural Network Regularization,” In arXiv:1409.2329, 2014. 

[2] Ann Taylor, Mitchell Marcus, Beatrice Santorini, “The Penn Treebank: An Overview, Treebanks: the state of the art in syntactically annotated corpora.” ed. / Anne Abeille. Kluwer, 2003.  

[3] Yann LeCun, L. Bottou, Y. Bengio, and P. Haffner. “Gradient-based learning applied to document recognition.” in Proceedings of the IEEE, 86(11):2278-2324, November 1998.  

 

 

DRS Enhancements in vSphere 6.7

A new paper describes the DRS enhancements in vSphere 6.7, which include new initial placement, host maintenance mode enhancements, DRS support for non-volatile memory (NVM), and enhanced resource pool reservations.

Resource pool and VM entitlements—old and new models

A summary of the improvements follows:

  • DRS in vSphere 6.7 can now take advantage of the much faster placement and more accurate recommendations for all DRS configurations. vSphere 6.5 did not include support for some configurations like VMs that had fault tolerance (FT) enabled, among others.
  • Starting with vSphere 6.7, DRS uses the new initial placement algorithm to come up with the recommended list of hosts to be placed in maintenance mode. Further, when evacuating the hosts, DRS uses the new initial placement algorithm to find new destination hosts for outgoing VMs.
  • DRS in vSphere 6.7 can handle VMs running on next generation persistent memory devices, also known as Non-Volatile Memory (NVM) devices.
  • There is a new two-pass algorithm that allocates a resource pool’s resource reservation
    to its children (also known as divvying).

For more information about all of these updates, see DRS Enhancements in vSphere 6.7.

VMware’s AI-based Performance Tool Can Improve Itself Automatically

PerfPsychic  our AI-based performance analyzing tool, enhances its accuracy rate from 21% to 91% with more data and training when debugging vSAN performance issues. What is better, PerfPsychic can continuously improve itself and the tuning procedure is automated. Let’s examine how we achieve this in the following sections.

How to Improve AI Model Accuracy

Three elements have huge impacts on the training results for deep learning models: amount of high-quality training data, reasonably configured hyperparameters that are used to control the training process, and sufficient but acceptable training time. In the following examples, we use the same training and testing dataset as we presented in our previous blog.

Amount of Training Data

The key of PerfPsychic is to prove the effectiveness of our deep learning pipeline, so we start by gradually adding more labeled data to the training dataset. This is to demonstrate how our models learn from more labeled data and improve their accuracy over time. Figure 1 shows the results where we start from only 20% of the training dataset and gradually label 20% more each time. It shows a clear trend that as more properly labeled data is added, our model learns and improves its accuracy, without any further human intervention. The accuracy is improved from around 50% when we have only about 1,000 data points to 91% when we have the full set of 5,275 data points. Such accuracy is as good as a programmatic analytic rule that took us three months to tune manually.

Figure 1. Accuracy improvement over larger training datasets

Training Hyperparameters

We next vary several other CNN hyperparameters to demonstrate how they were selected for our models. We change only one hyperparameter at a time and train 1,000 CNNs using the configuration. We first vary a different number of iterations in training, namely for how many times we go through the training dataset. If the number of iterations is too few, the CNNs cannot be trained adequately and, if the iteration number is too large, training will take a much longer time and it also might end up overfitting to the training data. As shown in Figure 2, between 50 to 75 iterations is the best range, where 75 iterations achieve the best accuracy of 91%.

Figure 2. Number of training iterations vs. accuracy

We next vary the step size, which is our granularity to search for the best model. In practice, with a small step size, the optimization is so slow that it cannot reach the optimal point in a limited time. With a large step size, we risk passing optimal points easily. Figure 3 shows that, between 5e-3 to 7.5e-3, the model produces good accuracy, where 5e-3 predicts 91% of the labels correctly.

Figure. 3 Step size vs. accuracy

We last evaluate the impact of issue rate of the training data in terms of accuracy. Issue rate is the percentage of training data that represents performance issues among the total. In an ideal set of training data, all the labels should be equally represented to avoid overfitting. A biased dataset generally results in overfitting models that can barely  achieve high accuracy. Figure 4 below shows that when the training data have under 20% of issue rate (that is, under 20% of the components are faulty), the model basically overfits to “noissue” data points and predicts all components are issue-free. Because our testing data have 21.9% of components without issues, it stays at 21.9% accuracy. In contrast, when we have over 80% of an issue rate, the model simply treats all components as faulty and thus achieves the 78.1% accuracy. This explains why it is important to ensure every label is equally represented, and why we mix our issue/noissue data in a ratio between 40% to 60%.

Figure 4. Impact of issue rate

Training Duration

Training time is also an important factor in a practical deep learning pipeline design. As we train thousands of CNN models, spending one second longer to train a model means a whole training phase will take 1,000 seconds longer. Figure 9 below shows the training time vs. data size and the number of iterations. As we can see, both factors form a linear trend; that is, with more data and more iterations, training will take linearly longer. Fortunately, we know from the study above that any more than 75 iterations will not help accuracy. By limiting the number of iterations, we can complete a whole phase of training in less than 9 hours. Again, once the off-line training is done, the model can perform real-time prediction in just a few milliseconds. The training time simply affects how often and how fast the models can pick up new feedback from product experts.

Figure 5. Effect of data size and iteration on training time

Automation

The model selection procedure is fully automated. Thousands of models with different hyperparameter settings are training in parallel on our GPU-enabled servers. The trained results compete with each other by analyzing our prepared testing data and reporting the final results. We then pick the model with the highest correct rate, put it into PerfPsychic and use it for online analysis. Moreover, we also keep a record of the parameters in the the winning models and use them as initial setups in future trainings. Therefore, our models can keep evolving.

PerfPsychic in Application

PerfPsychic is not only a research product, but also an internal performance analysis tool which is widely used. Now it is used to automatically analyze vSAN performance bugs on Bugzilla.

PerfPsychic automatically detects new vSAN performance bugs submitted in Bugzilla and extracts its usable data logs in the bug attachment. Then it analyzes the logs with the trained models. Finally, the analysis results are emailed to bug submitters and vSAN developer group where performance enhancement suggestions are included.

Below is part of an email received yesterday that gives performance tuning advice on a vSAN bug. Internal information are hidden.

Figure 6. Part of email generated by PerfPsychic to offer performance improvement suggestions