No Virtualization Tax for MLPerf Inference 3.0 Using NVIDIA Hopper and Ampere vGPUs and NVIDIA AI Software with vSphere 8.0 U1

By Uday Kurkure, Lan Vu, and Hari Sivaraman

Twenty-five years ago, VMware virtualized x86-based CPUs and has been a leader in virtualization technologies since then. VMware is again repeating its magic act in collaboration with NVIDIA and Dell in virtualizing accelerators for machine learning. We are announcing near bare metal or better than bare metal performance for MLPerf™ Inference v3.0 benchmarks.

Now you can run ML workloads in VMware vSphere with virtualized NVIDIA GPUs and combine the power of both for managing data centers. VMware vSphere is the first and only virtualization platform to be used in an MLPerf submission for MLPerf v0.7, MLPerf v1.1, and MLPerf v3.0 Inference publications.

Demonstrating the power of virtualization, VMware, Dell, and NVIDIA achieved from 94% to 105% of the equivalent bare metal performance with the following configuration:

Dell PowerEdge XE8545 server with 4x virtualized NVIDIA SXM A100-80GB GPU cards
Dell PowerEdge R750xa with 2x virtualized NVIDIA H100-PCIE-80GB GPU cards

Both setups used only 16 logical CPU cores out of 128, which means the remaining 112 logical CPU cores are available for additional demanding tasks in the customer datacenter. 

This displays the extraordinary power of VMware virtualization solutions that give near bare metal performance and the virtualization benefits of datacenter management without having to pay any “virtualization tax.”

Note: A paper related to this topic is also available; refer to VMware vSphere 8 Performance Is in the “Goldilocks Zone” for AI/MLTraining and Inference Workloads.

VMware and NVIDIA AI Enterprise

The partnership between VMware and NVIDIA brings virtualized GPUs to vSphere with NVIDIA AI Enterprise. This lets datacenter operators use the many benefits of VMware vSphere virtualization, such as cloning, vMotion, distributed resource scheduling, and suspending and resuming VMs, along with NVIDIA vGPU technology.

In this blog, we show the MLPerf Inference v3.0 test results for the vSphere virtualization platform with NVIDIA H100 and A100-based vGPUs. Our tests show that when NVIDIA vGPUs are used in vSphere, the workload performance is the same as or better than it is when run on a bare metal system.

MLPerf Inference Performance in vSphere with NVIDIA vGPU

VMware used the MLPerf Inference v3.0 suite to test the datacenter apps shown in table 1 below. MLPerf published the official results for these benchmarks.

ML/AI workloads are becoming pervasive in today’s datacenters and cover many domains. To show the flexibility of vSphere virtualization in disparate environments, we chose different types of workloads: natural language processing, represented by BERT; object detection, represented by RetinaNet; medical imaging, represented by 3D U-Net; and speech, represented by RNNT.

Area	Task	Model	Dataset	QSL Size	Quality	Scenarios	Server Latency Constraint
Vision	Object detection	RetinaNet	OpenImages (800×800)	64	99% of FP32 (0.20 mAP)	Server, Offline	100 ms
Vision	Medical image segmentation	3D U-Net	KITS 2019 (602x512x512)	16	99% of FP32 and 99.9% of FP32 (0.86330 mean DICE score)	Offline	N/A
Speech	Speech-to-text	RNNT	Librispeech dev-clean (samples < 15 seconds)	2513	99% of FP32 (1 – WER, where WER=7.452253714852645%)	Server, Offline	1000 ms
Language	Language processing	BERT-large	SQuAD v1.1 (max_seq_len=384)	10833	99% of FP32 and 99.9% of FP32 (f1_score=90.874%)	Server, Offline	130 ms

Table 1. The MLPerf Inference benchmarks used in our performance study

We focused on Offline and Server scenarios. The Offline scenario processes queries in a batch where all the input data is immediately available. The latency is not a critical metric in this scenario. In the Server scenario, the query arrival is random. Each query has an arrival rate determined by the Poisson distribution parameter. Each query has only one sample and, in this case, the latency for serving a query is a critical metric.

Hardware/Software Configurations for Virtualized NVIDIA H100 and NVIDIA A100 GPUs

Table 2 shows the hardware configurations used to run workloads on the bare metal and virtualized systems featuring the virtualized H100 GPU card. The most salient difference in the configurations is that the virtual configuration used a virtualized H100 GPU, denoted by GRID H100-80c vGPU. Note that the H100-80c vGPU profile is for time-sliced mode. Both the systems had the same 2x H100-PCIE-80GB physical GPUs. The benchmarks were optimized with NVIDIA TensorRT.

	Bare Metal	Virtual Configuration
System	Dell PowerEdge R750xa	Dell PowerEdge R750xa
Processors	2x Intel Xeon Platinum 8358	2x Intel Xeon Platinum 8358
Logical Processor	128	16 allocated to the VM (112 available for other VMs/workloads)
GPU	2x NVIDIA H100-PCIE-80GB	2x NVIDIA GRID H100-80c vGPU
Memory	256GB	128GB
Storage	3.0TB NVMe SSD	3.0TB NVMe SSD
OS	Ubuntu 20.04	Ubuntu 20.04 VM in vSphere 8.0.1
NVIDIA AIE VIB for ESXi	–	vGPU GRID Driver 525.85.07
CUDA	12	12
TensorRT	8.6.0	8.6.0
MLPerf Inference	v3.0	v3.0

Table 2. Bare metal vs. virtual server configurations for virtualized H100

Table 3 describes the hardware configurations used for bare metal and virtual runs for virtualized A100. The most salient difference in the configurations is that the virtual configuration used a virtualized A100 GPU denoted by GRID A100-80c vGPU. Note that the A100-80c vGPU profile is for time-sliced mode. Both the systems had the same 4x A100-SXM–80GB physical GPUs. The benchmarks were optimized with NVIDIA TensorRT.

	Bare Metal	Virtual Configuration
System	Dell PowerEdge XE8545	Dell PowerEdge XE8545
Processors	2x AMD EPYC 7543	2x AMD EPYC 7543
Logical Processor	128	16 allocated to the VM (112 available for other VMs/workloads)
GPU	4x NVIDIA A100-SXM-80GB	4x NVIDIA GRID A100-80c vGPU
Memory	1 TB	128GB
Storage	3.0 TB NVME SSD	3.0 TB NVME SSD
OS	Ubuntu 20.04	Ubuntu 20.04 VM in vSphere 8.0.1
NVIDIA AIE VIB for ESXi	–	vGPU GRID Driver 525.85.07
CUDA	12	12
TensorRT	8.6.0	8.6.0
MLPerf Inference	V3.0	V3.0

Table 3. Bare metal vs. virtual server configurations for virtualized A100

MLPerf Inference Performance Results for Bare Metal and Virtual Configurations

Figures 1 and 2 compare the throughput (queries processed per second) of MLPerf Inference benchmark workloads using vSphere 8.0.1 with NVIDIA vGPU H100-80c against the bare metal H100 GPU configuration. The bare metal baseline is set to 1.000, and the virtualized result is presented as a relative percentage of the baseline. vSphere with NVIDIA vGPUs delivers near bare metal performance ranging from 94.4% to 105% for Offline and Server scenarios when using the MLPerf Inference benchmarks.

Figure 1. Normalized throughput for Offline scenario (qps): vGPU 2x H100-80c vs. bare metal 2x H100

Figure 2. Normalized throughput for Server scenario (qps): vGPU 2x H100-80c vs. bare metal 2x H100

Table 4 shows throughput numbers in queries per second for MLPerf Inference benchmarks.

Benchmark	Bare Metal 2x H100	vGPU 2x H100-80c	vGPU/BM
RetinaNet Server	1852.24	1772.81	0.96
RetinaNet Offline	1892.09	1800.60	0.95
3d-UNET-99 Offline	9.05	8.76	0.97
3d-UNET-99.9 Offline	9.05	8.76	0.97
RNNT Server	32004.00	31131.20	0.97
RNNT Offline	33741.00	32771.40	0.97

Table 4. vGPU 2x H100-80c vs. bare metal 2x H00 throughput (queries per second)

The above results are published by MLCommons in the closed division with the submitter id of 3.0-0017.

Figures 3 and 4 compare throughput (queries processed per second) for MLPerf Inference benchmarks using vSphere 8.0.1 with NVIDIA vGPU A100-80c against the bare metal A100 GPU configuration. The bare metal baseline is set to 1.000, and the virtualized result is presented as a relative percentage of the baseline. vSphere with NVIDIA vGPUs delivers near bare metal performance ranging from 94.4% to 105% for offline and server scenarios for MLPerf Inference benchmarks.

Figure 3. Normalized throughput for Offline scenario (qps): vGPU 4x A100-80c vs. bare metal 4x H100-80GB

Figure 4. Normalized throughput for Server scenario (qps): vGPU 4x A100-80c vs. bare metal 4x A100

Table 5 shows throughput numbers for MLPerf Inference benchmarks.

Benchmark	Bare Metal 4x A100	vGPU 4x A100-80c	vGPU/BM
Bert Server	13597.00	13497.90	0.99
Bert Offline	15090.00	14923.10	0.99
Bert HighA Server	7004.00	7004.02	1.00
Bert HighA Offline	7880.00	7767.84	0.99
RetinaNet Server	2848.84	2798.93	0.98
RetinaNet Offline	2910.78	2876.56	0.99
RNNT Server	54000.40	51001.80	0.94
RNNT Offline	57084.00	56174.00	0.98
3d-UNET-99 Offline	14.44	15.10	1.05
3d-UNET-99.9 Server	14.44	15.10	1.05

Table 5. vGPU4x A100-80C vs. bare metal 4x A100-80GB throughput (queries/second)

The above results are published by MLCommons in the closed division with the submitter id 3.0-0018.

Takeaways

VMware+NVIDIA AI Enterprise using NVIDIA vGPUs and NVIDIA AI software delivers from 94% to 105% of the bare metal performance for MLPerf Inference v3.0 benchmarks.
VMware achieved this performance with only 16 logical CPU cores out of 128 available CPU cores, leaving 112 logical CPU cores for other jobs in the datacenter. This is the extraordinary power of virtualization!
VMware vSphere combines the power of NVIDIA vGPUs and NVIDIA AI software with the datacenter management benefits of virtualization.

Acknowledgements

VMware thanks Liz Raymond and Yunfan Han of Dell; and Charlie Huang, Manvendar Rawat, and Jason Kyungho Lee of NVIDIA for providing the hardware and software for VMware’s MLPerf Inference submission. The authors would like to acknowledge Juan Garcia-Rovetta and Tony Lin of VMware for their management support.

References

MLCommons
https://mlcommons.org/
MLCommons April 05, 2023 – Inference: Datacenter v3.0 Results
https://mlcommons.org/en/inference-datacenter-30/
MLCommons September 22, 2021 – Inference: Datacenter v1.1 Results
https://mlcommons.org/en/inference-datacenter-11/
NVIDIA Ampere Architecture
https://www.nvidia.com/en-us/data-center/ampere-architecture/
NVIDIA Hopper Architecture In-Depth
https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth
NVIDIA Ampere Architecture In-Depth
https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth
NVIDIA Docs Hub
https://docs.nvidia.com/ai-enterprise/latest/user-guide/index.html#supported-gpus-grid-vgpu
MIG or vGPU Mode for NVIDIA Ampere GPU: Which One Should I Use? (Part 1 of 3)
https://blogs.vmware.com/performance/2021/09/mig-or-vgpu-part1.html
Introduction to MLPerf™ Inference v1.1 with Dell EMC Servers
https://infohub.delltechnologies.com/p/introduction-to-mlperf-tm-inference-v1-1-with-dell-emc-servers
MLPerf Inference Virtualization in VMware vSphere Using NVIDIA vGPUs
https://blogs.vmware.com/performance/2020/12/mlperf-inference-virtualization-in-vmware-vsphere-using-nvidia-vgpus.html
NVIDIA T4
https://www.nvidia.com/en-us/data-center/tesla-t4/
NVIDIA Triton
https://developer.nvidia.com/nvidia-triton-inference-server
NVIDIA TensorRT
https://developer.nvidia.com/tensorrt
NVIDIA Turing GPU Architecture
https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf
V. J. Reddi et al., “MLPerf Inference Benchmark,” 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), Valencia, Spain, 2020, pp. 446-459, doi: 10.1109/ISCA45697.2020.00045.

No Virtualization Tax for MLPerf Inference 3.0 Using NVIDIA Hopper and Ampere vGPUs and NVIDIA AI Software with vSphere 8.0 U1

VMware and NVIDIA AI Enterprise

MLPerf Inference Performance in vSphere with NVIDIA vGPU

Hardware/Software Configurations for Virtualized NVIDIA H100 and NVIDIA A100 GPUs

MLPerf Inference Performance Results for Bare Metal and Virtual Configurations

Takeaways

Acknowledgements

References

Related Articles

Scaling Up Machine Learning Training in VMware vSphere with NVLink-connected vGPUs and NVIDIA AI Enterprise

Extreme Performance Series: VMware + NVIDIA AI Enterprise vGPU Benchmarks

VMware and NVIDIA solutions deliver high performance in machine learning workloads