Machine Learning AI Benchmarks GPU NVIDIA Performance Virtualization vSphere

No Virtualization Tax for MLPerf Inference 3.0 Using NVIDIA Hopper and Ampere vGPUs and NVIDIA AI Software with vSphere 8.0 U1

By Uday Kurkure, Lan Vu, and Hari Sivaraman

Twenty-five years ago, VMware virtualized x86-based CPUs and has been a leader in virtualization technologies since then. VMware is again repeating its magic act in collaboration with NVIDIA and Dell in virtualizing accelerators for machine learning. We are announcing near bare metal or better than bare metal performance for MLPerf™ Inference v3.0 benchmarks.  

Now you can run ML workloads in VMware vSphere with virtualized NVIDIA GPUs and combine the power of both for managing data centers. VMware vSphere is the first and only virtualization platform to be used in an MLPerf submission for MLPerf v0.7, MLPerf v1.1, and MLPerf v3.0 Inference publications.  

Demonstrating the power of virtualization, VMware, Dell, and NVIDIA achieved from 94% to 105% of the equivalent bare metal performance with the following configuration: 

  • Dell PowerEdge XE8545 server with 4x virtualized NVIDIA SXM A100-80GB GPU cards 
  • Dell PowerEdge R750xa with 2x virtualized NVIDIA H100-PCIE-80GB GPU cards 

Both setups used only 16 logical CPU cores out of 128, which means the remaining 112 logical CPU cores are available for additional demanding tasks in the customer datacenter.  

This displays the extraordinary power of VMware virtualization solutions that give near bare metal performance and the virtualization benefits of datacenter management without having to pay any “virtualization tax.” 

Note: A paper related to this topic is also available; refer to VMware vSphere 8 Performance Is in the “Goldilocks Zone” for AI/MLTraining and Inference Workloads.

VMware and NVIDIA AI Enterprise

The partnership between VMware and NVIDIA brings virtualized GPUs to vSphere with NVIDIA AI Enterprise. This lets datacenter operators use the many benefits of VMware vSphere virtualization, such as cloning, vMotion, distributed resource scheduling, and suspending and resuming VMs, along with NVIDIA vGPU technology. 

In this blog, we show the MLPerf Inference v3.0 test results for the vSphere virtualization platform with NVIDIA H100 and A100-based vGPUs. Our tests show that when NVIDIA vGPUs are used in vSphere, the workload performance is the same as or better than it is when run on a bare metal system. 

MLPerf Inference Performance in vSphere with NVIDIA vGPU

VMware used the MLPerf Inference v3.0 suite to test the datacenter apps shown in table 1 below. MLPerf published the official results for these benchmarks.  

ML/AI workloads are becoming pervasive in today’s datacenters and cover many domains. To show the flexibility of vSphere virtualization in disparate environments, we chose different types of workloads: natural language processing, represented by BERT; object detection, represented by RetinaNet; medical imaging, represented by 3D U-Net; and speech, represented by RNNT. 

Area  Task  Model  Dataset  QSL Size  Quality  Scenarios  Server Latency Constraint 
Vision  Object detection  RetinaNet  OpenImages (800×800)  64  99% of FP32 (0.20 mAP)  Server, Offline  100 ms 
Vision  Medical image segmentation  3D U-Net  KITS 2019 (602x512x512)  16  99% of FP32 and 99.9% of FP32 (0.86330 mean DICE score)  Offline  N/A 
Speech  Speech-to-text  RNNT  Librispeech dev-clean (samples < 15 seconds)  2513  99% of FP32 (1 – WER, where WER=7.452253714852645%)  Server, Offline  1000 ms 
Language  Language processing  BERT-large  SQuAD v1.1 (max_seq_len=384)  10833  99% of FP32 and 99.9% of FP32 (f1_score=90.874%)  Server, Offline  130 ms 

Table 1. The MLPerf Inference benchmarks used in our performance study 

We focused on Offline and Server scenarios. The Offline scenario processes queries in a batch where all the input data is immediately available. The latency is not a critical metric in this scenario. In the Server scenario, the query arrival is random. Each query has an arrival rate determined by the Poisson distribution parameter. Each query has only one sample and, in this case, the latency for serving a query is a critical metric. 

Hardware/Software Configurations for Virtualized NVIDIA H100 and NVIDIA A100 GPUs

Table 2 shows the hardware configurations used to run workloads on the bare metal and virtualized systems featuring the virtualized H100 GPU card. The most salient difference in the configurations is that the virtual configuration used a virtualized H100 GPU, denoted by GRID H100-80c vGPU. Note that the H100-80c vGPU profile is for time-sliced mode. Both the systems had the same 2x H100-PCIE-80GB physical GPUs. The benchmarks were optimized with NVIDIA TensorRT.  

  Bare Metal  Virtual Configuration 
System  Dell PowerEdge R750xa  Dell PowerEdge R750xa 

 

Processors  2x Intel Xeon Platinum 8358  2x Intel Xeon Platinum 8358 
Logical Processor  128  16 allocated to the VM 

(112 available for other VMs/workloads) 

GPU  2x NVIDIA H100-PCIE-80GB  2x NVIDIA GRID H100-80c vGPU 
Memory  256GB  128GB 
Storage  3.0TB NVMe SSD 

 

3.0TB NVMe SSD 

 

OS  Ubuntu 20.04 

 

Ubuntu 20.04 VM in vSphere 8.0.1 

 

NVIDIA AIE VIB for ESXi    vGPU GRID Driver 525.85.07 
CUDA  12  12 
TensorRT  8.6.0  8.6.0 
MLPerf Inference  v3.0  v3.0 

Table 2.  Bare metal vs. virtual server configurations for virtualized H100 

Table 3 describes the hardware configurations used for bare metal and virtual runs for virtualized A100. The most salient difference in the configurations is that the virtual configuration used a virtualized A100 GPU denoted by GRID A100-80c vGPU. Note that the A100-80c vGPU profile is for time-sliced mode. Both the systems had the same 4x A100-SXM80GB physical GPUs. The benchmarks were optimized with NVIDIA TensorRT. 

  Bare Metal  Virtual Configuration 
System  Dell PowerEdge XE8545  Dell PowerEdge XE8545 

 

Processors  2x AMD EPYC 7543  2x AMD EPYC 7543 

 

Logical Processor  128  16 allocated to the VM 

(112 available for other VMs/workloads) 

GPU  4x NVIDIA A100-SXM-80GB  4x NVIDIA GRID A100-80c vGPU 
Memory  1 TB  128GB 
Storage  3.0 TB NVME SSD 

 

3.0 TB NVME SSD 

 

OS  Ubuntu 20.04 

 

Ubuntu 20.04 VM in vSphere 8.0.1 

 

NVIDIA AIE VIB for ESXi    vGPU GRID Driver 525.85.07 
CUDA  12  12 
TensorRT  8.6.0  8.6.0 
MLPerf Inference  V3.0  V3.0 

Table 3.  Bare metal vs. virtual server configurations for virtualized A100 

MLPerf Inference Performance Results for Bare Metal and Virtual Configurations

Figures 1 and 2 compare the throughput (queries processed per second) of MLPerf Inference benchmark workloads using vSphere 8.0.1 with NVIDIA vGPU H100-80c against the bare metal H100 GPU configuration. The bare metal baseline is set to 1.000, and the virtualized result is presented as a relative percentage of the baseline. vSphere with NVIDIA vGPUs delivers near bare metal performance ranging from 94.4% to 105% for Offline and Server scenarios when using the MLPerf Inference benchmarks.  


Figure 1. Normalized throughput for Offline scenario (qps): vGPU 2x H100-80c vs. bare metal 2x H100 

Figure 2. Normalized throughput for Server scenario (qps): vGPU 2x H100-80c vs. bare metal 2x H100 

Table 4 shows throughput numbers in queries per second for MLPerf Inference benchmarks. 

Benchmark Bare Metal 2x H100  vGPU 2x H100-80c  vGPU/BM 
RetinaNet Server  1852.24  1772.81  0.96
RetinaNet Offline  1892.09  1800.60  0.95
3d-UNET-99 Offline  9.05  8.76  0.97 
3d-UNET-99.9 Offline  9.05  8.76  0.97 
RNNT Server  32004.00  31131.20  0.97 
RNNT Offline  33741.00  32771.40  0.97 

Table 4. vGPU 2x H100-80c vs. bare metal 2x H00 throughput (queries per second)

The above results are published by MLCommons in the closed division with the submitter id of 3.0-0017. 

Figures 3 and 4 compare throughput (queries processed per second) for MLPerf Inference benchmarks using vSphere 8.0.1 with NVIDIA vGPU A100-80c against the bare metal A100 GPU configuration. The bare metal baseline is set to 1.000, and the virtualized result is presented as a relative percentage of the baseline. vSphere with NVIDIA vGPUs delivers near bare metal performance ranging from 94.4% to 105% for offline and server scenarios for MLPerf Inference benchmarks. 

Figure 3. Normalized throughput for Offline scenario (qps): vGPU 4x A100-80c vs. bare metal 4x H100-80GB

Figure 4. Normalized throughput for Server scenario (qps): vGPU 4x A100-80c vs. bare metal 4x A100

Table 5 shows throughput numbers for MLPerf Inference benchmarks.   

Benchmark  Bare Metal 4x A100  vGPU
4x A100-80c 
vGPU/BM 
Bert Server  13597.00  13497.90  0.99 
Bert Offline  15090.00  14923.10  0.99 
Bert HighA Server  7004.00  7004.02  1.00 
Bert HighA Offline  7880.00  7767.84  0.99 
RetinaNet Server  2848.84  2798.93  0.98 
RetinaNet Offline  2910.78  2876.56  0.99 
RNNT Server  54000.40  51001.80  0.94 
RNNT Offline  57084.00  56174.00  0.98 
3d-UNET-99 Offline  14.44  15.10  1.05 
3d-UNET-99.9 Server  14.44  15.10  1.05 

Table 5. vGPU4x A100-80C vs. bare metal 4x A100-80GB throughput (queries/second) 

The above results are published by MLCommons in the closed division with the submitter id 3.0-0018. 

Takeaways

  • VMware+NVIDIA AI Enterprise using NVIDIA vGPUs and NVIDIA AI software delivers from 94% to 105% of the bare metal performance for MLPerf Inference v3.0 benchmarks. 
  • VMware achieved this performance with only 16 logical CPU cores out of 128 available CPU cores, leaving 112 logical CPU cores for other jobs in the datacenter. This is the extraordinary power of virtualization! 
  • VMware vSphere combines the power of NVIDIA vGPUs and NVIDIA AI software with the datacenter management benefits of virtualization. 

Acknowledgements

VMware thanks Liz Raymond and Yunfan Han of Dell; and Charlie Huang, Manvendar Rawat, and Jason  Kyungho Lee of NVIDIA for providing the hardware and software for VMware’s MLPerf  Inference submission. The authors would like to acknowledge Juan Garcia-Rovetta and Tony Lin of VMware for their management support. 

References