VMware and NVIDIA solutions deliver high performance in machine learning workloads

By Uday Kurkure, Lan Vu, and Hari Sivaraman

VMware, with Dell, submitted its MLPerf Inference v1.1 benchmark results to MLCommons. The accepted and published results show that high performance with machine learning workloads can be achieved on a VMware virtualized platform featuring NVIDIA GPU and AI technology.

The testbed consisted of a VMware + NVIDIA AI-Ready Enterprise Platform, which included:

VMware vSphere 7.0 U2 data center virtualization software
NVIDIA AI Enterprise software
Three NVIDIA A100 Tensor Core GPUs
Dell EMC PowerEdge R7525 rack server
Two AMD EPYC 7502 processors with 128 logical cores

To learn of any virtualization overhead, VMware benchmarked this solution against an identical system that ran on bare metal (no virtualization). The MLPerf benchmark results showed the virtualized system achieved from 94.4% to 100% of the equivalent bare metal performance with only 24 logical CPU cores and 3 NVIDIA vGPU A100-40c. By using the virtualized platform, you’d still have 104 logical CPU cores available for additional demanding tasks in your data center. This solution displays extraordinary power by achieving near bare metal performance while providing all the virtualization benefits of VMware vSphere: server consolidation, power savings, virtual machine over-commitment, vMotion, high availability, DRS, central management with vCenter, suspend/resume VMs, cloning, and more.

VMware vSphere and NVIDIA AI Enterprise

VMware and NVIDIA have partnered to unlock the power of AI for every business by delivering an end-to-end enterprise platform optimized for AI workloads. This integrated platform delivers best-in-class AI software: the NVIDIA AI Enterprise suite. It’s optimized and exclusively certified by NVIDIA for the industry’s leading virtualization platform: VMware vSphere. The platform:

Accelerates the speed at which developers can build AI and high-performance data analytics
Enables organizations to scale modern workloads on the same VMware vSphere infrastructure in which they have already invested
Delivers enterprise-class manageability, security, and availability

Furthermore, with VMware vSphere with Tanzu, enterprises can run containers alongside their existing VMs.

Figure 1. NVIDIA and VMware products working alongside each other

VMware has pioneered compute, storage, and network virtualization, reshaping yesterday’s bare metal data centers into modern software-defined data centers (SDDC). Despite this platform availability, many machine learning workloads are still run on bare metal systems. Deep learning workloads are so compute-intensive that they require compute accelerators like NVIDIA GPUs and software optimized for AI; however, many accelerators are not yet fully virtualized. Deploying unvirtualized accelerators makes such systems difficult to manage when deployed at scale in data centers. VMware’s collaboration with NVIDIA brings virtualized GPUs to data centers, allowing data center operators to leverage the many benefits of virtualization.

Architectural Features of NVIDIA GPU Ampere A100 and TensorRT

VMware used virtualized NVIDIA A100 Tensor Core GPUs with NVIDIA AI Enterprise software in vSphere for MLPerf Inference v1.1 benchmarks. VMware also used NVIDIA TensorRT, which is included with the NVIDIA AI Enterprise software suite. NVIDIA TensorRT is an SDK for high performance, deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for deep learning inference applications.

Figure 2. Five features of NVIDIA A100 GPUs

The NVIDIA Ampere architecture is designed to accelerate diverse cloud workloads, including high performance computing, deep learning training and inference, machine learning, data analytics, and graphics. We will focus on NVIDIA’s virtualization-related technologies: NVIDIA Multi-Instance GPU (MIG) and time-sliced vGPU.

NVIDIA A100 GPUs are shared among multiple VMs in two ways:

Temporal Sharing/Time Slice: All the CUDA cores are time multiplexed while the GPU HBM2 memory is statically divided equally between VMs. This sharing is also known as time-sliced sharing.
Spatial Sharing/MIG: Each NVIDIA A100 GPU can be composed of seven physical slices. These slices can be combined to make bigger slices. This is also known as Multi-Instance GPU (MIG). The GPU can be partitioned in up to seven slices, and each slice can support a single VM. There is no time slicing.

Table 1 below describes NVIDIA Ampere for data center deployment.

Table 1. NVIDIA GPUs for Data Center Deployment in VMware vSphere

Virtualized NVIDIA A100 GPUs in VMware vSphere

NVIDIA AI Enterprise software suite includes the software needed to virtualize an NVIDIA GPU, including the NVIDIA vGPU manager that is installed in VMware ESXi hypervisor. Multiple, different guest operating systems running in the VMs can share the GPU. They could be running diverse ML/AI workloads. See figure 3.

Figure 3. Virtualized NVIDIA A100 GPU with MIGs in VMware vSphere

Time-Sliced Virtual GPU Types for NVIDIA A100 PCIe 40GB

These vGPU types support a single display with a fixed maximum resolution.

For details of GPU instance profiles, see NVIDIA Multi-Instance GPU User Guide.

Table 2 shows different types of vGPU profiles in time-sliced mode to be attached to VMs.

Virtual NVIDIA GPU Type	Intended Use Case	Frame Buffer (MB)	Maximum vGPUs per GPU	Maximum vGPUs per Board	Maximum Display Resolution	Virtual Displays per vGPU
A100-40C	Training Workloads	40960	1	1	4096×2160	1
A100-20C	Training Workloads	20480	2	2	4096×2160	1
A100-10C	Training Workloads	10240	4	4	4096×2160	1
A100-8C	Training Workloads	8192	5	5	4096×2160	1
A100-5C	Inference Workloads	5120	8	8	4096×2160	1
A100-4C	Inference Workloads	4096	10	10	4096×2160	1

Table 2. A100 vGPU profiles

MIG-Backed Virtual GPU Types for NVIDIA A100 PCIe 40GB

For details of GPU instance profiles, see NVIDIA Multi-Instance GPU User Guide.

Table 3 shows MIG-backed vGPU profiles for the A100 GPU.

Virtual NVIDIA GPU Type	Intended Use Case	Frame Buffer (MB)	Maximum vGPUs per GPU	Slices per vGPU	Compute Instances per vGPU	Corresponding GPU Instance Profile
A100-7-40C	Training Workloads	40960	1	7	7	MIG 7g.40gb
A100-4-20C	Training Workloads	20480	1	4	4	MIG 4g.20gb
A100-3-20C	Training Workloads	20480	2	3	3	MIG 3g.20gb
A100-2-10C	Training Workloads	10240	3	2	2	MIG 2g.10gb
A100-1-5C	Inference Workloads	5120	7	1	1	MIG 1g.5gb
A100-1-5CME	Inference Workloads	5120	1	1	1	MIG 1g.5gb+me

Table 3. A100 MIG-backed vGPU Profiles

MLPerf Inference Performance in VMware vSphere with NVIDIA vGPU

VMware benchmarked the following datacenter applications from the MLPerf Inference v1.1 suite. See Table 4. The official results for these two benchmarks are published by MLPerf.

ML/AI workloads are becoming pervasive in today’s data centers and cover many domains. To show the flexibility of VMware vSphere virtualization in disparate environments, we chose to publish two of the most popular types of workloads: natural language processing, represented by BERT; and object detection, represented by SSD-ResNet34. To find what virtualization overhead there was (if any), we ran each workload in virtualized and bare metal environments.

Area	Task	Model	Dataset	QSL Size	Quality	Server latency constraint
Vision	Object detection (large)	SSD-ResNet34	COCO (1200×1200)	64	99% of FP32 (0.20 mAP)	100 ms
Language	Language processing	BERT	SQuAD v1.1 (max_seq_len=384)	10833	99% of FP32 and 99.9% of FP32 (f1_score=90.874%)	130 ms

Table 4. MLPerf Inference Benchmarks presented

We focused on Offline and Server scenarios. The Offline scenario processes queries in a batch where all the input data is immediately available. The latency is not a critical metric in this scenario. In the Server scenario, the query arrival is random. Each query has an arrival rate determined by the Poisson distribution parameter. Each query has only one sample and, in this case, the latency for serving a query is a critical metric.

Hardware/Software Configurations

Table 5 describes the hardware configurations used for bare metal and virtual runs. The most salient difference in the configurations is that the virtual configuration used virtualized A100 GPU, denoted by NVIDIA GRID A100-40c vGPU. Both the systems had the same 3x A100-PCIE-40GB physical GPUs. The benchmarks were optimized with NVIDIA TensorRT and used NVIDIA Triton Inference server.

	Bare Metal	Virtual Configuration
System	Dell EMC PowerEdge R7525	Dell EMC PowerEdge R77525
Processors	2x AMD EPYC 7502	2x AMD EPYC 7502
Logical Processor	128	24 allocated to the VM
GPU	3x NVIDIA A100-PCIE-40GB	3x NVIDIA GRID A100-40c vGPU
Memory	512 GB	448 GB for the VM
Storage	3.8 TB SSD	3.8 TB SSD
OS	CentOS 8.2	Ubuntu 20.04 VM in VMware vSphere 7.0.2
NVIDIA AIE VIB for ESXi	–	vGPU GRID Driver 470.60
NVIDIA Driver	470.57	470.57
CUDA	11.3	11.3
TensorRT	8.0.2	8.0.2
Container	Docker 20.10.2	Docker 20.10.2
MLPerf Inference	V1.1	V1.1

Table 5. Bare Metal vs. Virtual Server Configurations

MLPerf Inference Performance Results for Bare Metal and Virtual Configurations

Figure 4 compares throughput (queries processed per second) for the MLPerf Inference workloads using VMware vSphere 7.0 U2 with NVIDIA vGPUs against the bare metal configuration. The bare metal baseline is set to 1.00, and the virtualized result is presented as a relative percentage of the baseline.

Figure 4 clearly shows that VMware vSphere with NVIDIA vGPUs delivers near bare metal performance ranging from 94.4% to 100% for offline and server scenarios for MLPerf Inference benchmarks.

Figure 4. Normalized throughput (qps): NVIDIA vGPU vs bare metal

Table 6 shows throughput numbers for MLPerf Inference workloads.

Benchmark	Bare Metal 3x NVIDIA A100	NVIDIA vGPU 3x A100-40c	vGPU/BM
Bert Offline	8465	8475	1.001
Bert Server	7871	7801	0.991
Bert-99.9* Offline	4228	4183	0.989
Bert-99.9* Server	3814	3602	0.944
ssd-resnet34 Offline	2450	2452	1.001
ssd-resnet34 Server	2390	2310	0.967

Table 6. vGPU vs. bare metal throughput (queries/second)

* Bert-99.9 is a 99.9% accuracy achieving deployment.

The above results are published by MLCommons.

Takeaways

The VMware/NVIDIA solution delivers from 94.4% to 100% of the bare metal performance for MLPerf Inference v1.1 benchmarks.
VMware achieved this performance with only 24 logical CPU cores out of 128 available CPU cores, thus leaving 104 logical CPU cores for other jobs in the data center. This is the extraordinary power of virtualization!
VMware vSphere combines the power of NVIDIA AI Enterprise software, which includes NVIDIA’s vGPU technology with the many data center management benefits of virtualization.

Acknowledgements

VMware thanks Liz Raymond and Frank Han of Dell and Vinay Bagade, Charlie Huang, Anne Hecht, Manvendar Rawat, and Raj Rao of NVIDIA for providing the hardware and software for VMware’s MLPerf v1.1 inference submission. The authors would like to acknowledge Juan Garcia-Rovetta and Tony Lin of VMware for the management support.

References

MLCommons
https://mlcommons.org/en/
MLCommons v1.1 Results
https://mlcommons.org/en/inference-datacenter-11
NVIDIA Ampere Architecture
https://www.nvidia.com/en-us/data-center/ampere-architecture
NVIDIA Ampere Architecture In-Depth
https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth
NVIDIA Enterprise Documentation: Virtual GPU Types for Supported GPUs
https://docs.nvidia.com/ai-enterprise/latest/user-guide/index.html#supported-gpus-grid-vgpu
MIG or vGPU Mode for NVIDIA Ampere GPU: Which One Should I Use? (Part 1 of 3)
https://blogs.vmware.com/performance/2021/09/mig-or-vgpu-part1.html
Introduction to MLPerf Inference v1.1 with Dell EMC Servers
https://infohub.delltechnologies.com/p/introduction-to-mlperf-tm-inference-v1-1-with-dell-emc-servers
MLPerf Inference Virtualization in VMware vSphere Using NVIDIA vGPUs
https://blogs.vmware.com/performance/2020/12/mlperf-inference-virtualization-in-vmware-vsphere-using-nvidia-vgpus.html
NVIDIA TensorRT
https://developer.nvidia.com/tensorrt
NVIDIA Triton Inference Server
https://developer.nvidia.com/nvidia-triton-inference-server
J. Reddiet al., “MLPerf Inference Benchmark,” 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), Valencia, Spain, 2020, pp. 446-459, doi: 10.1109/ISCA45697.2020.00045.