By Uday Kurkure, Lan Vu, and Hari Sivaraman
Twenty-five years ago, VMware virtualized x86-based CPUs and has been a leader in virtualization technologies since then. VMware is again repeating its magic act in collaboration with NVIDIA and Dell in virtualizing accelerators for machine learning. We are announcing near bare metal or better than bare metal performance for MLPerf™ Inference v3.0 benchmarks.
Now you can run ML workloads in VMware vSphere with virtualized NVIDIA GPUs and combine the power of both for managing data centers. VMware vSphere is the first and only virtualization platform to be used in an MLPerf submission for MLPerf v0.7, MLPerf v1.1, and MLPerf v3.0 Inference publications.
Demonstrating the power of virtualization, VMware, Dell, and NVIDIA achieved from 94% to 105% of the equivalent bare metal performance with the following configuration:
- Dell PowerEdge XE8545 server with 4x virtualized NVIDIA SXM A100-80GB GPU cards
- Dell PowerEdge R750xa with 2x virtualized NVIDIA H100-PCIE-80GB GPU cards
Both setups used only 16 logical CPU cores out of 128, which means the remaining 112 logical CPU cores are available for additional demanding tasks in the customer datacenter.
This displays the extraordinary power of VMware virtualization solutions that give near bare metal performance and the virtualization benefits of datacenter management without having to pay any “virtualization tax.”
Note: A paper related to this topic is also available; refer to VMware vSphere 8 Performance Is in the “Goldilocks Zone” for AI/MLTraining and Inference Workloads.
VMware and NVIDIA AI Enterprise
The partnership between VMware and NVIDIA brings virtualized GPUs to vSphere with NVIDIA AI Enterprise. This lets datacenter operators use the many benefits of VMware vSphere virtualization, such as cloning, vMotion, distributed resource scheduling, and suspending and resuming VMs, along with NVIDIA vGPU technology.
In this blog, we show the MLPerf Inference v3.0 test results for the vSphere virtualization platform with NVIDIA H100 and A100-based vGPUs. Our tests show that when NVIDIA vGPUs are used in vSphere, the workload performance is the same as or better than it is when run on a bare metal system.
MLPerf Inference Performance in vSphere with NVIDIA vGPU
VMware used the MLPerf Inference v3.0 suite to test the datacenter apps shown in table 1 below. MLPerf published the official results for these benchmarks.
ML/AI workloads are becoming pervasive in today’s datacenters and cover many domains. To show the flexibility of vSphere virtualization in disparate environments, we chose different types of workloads: natural language processing, represented by BERT; object detection, represented by RetinaNet; medical imaging, represented by 3D U-Net; and speech, represented by RNNT.
Area | Task | Model | Dataset | QSL Size | Quality | Scenarios | Server Latency Constraint |
Vision | Object detection | RetinaNet | OpenImages (800×800) | 64 | 99% of FP32 (0.20 mAP) | Server, Offline | 100 ms |
Vision | Medical image segmentation | 3D U-Net | KITS 2019 (602x512x512) | 16 | 99% of FP32 and 99.9% of FP32 (0.86330 mean DICE score) | Offline | N/A |
Speech | Speech-to-text | RNNT | Librispeech dev-clean (samples < 15 seconds) | 2513 | 99% of FP32 (1 – WER, where WER=7.452253714852645%) | Server, Offline | 1000 ms |
Language | Language processing | BERT-large | SQuAD v1.1 (max_seq_len=384) | 10833 | 99% of FP32 and 99.9% of FP32 (f1_score=90.874%) | Server, Offline | 130 ms |
Table 1. The MLPerf Inference benchmarks used in our performance study
We focused on Offline and Server scenarios. The Offline scenario processes queries in a batch where all the input data is immediately available. The latency is not a critical metric in this scenario. In the Server scenario, the query arrival is random. Each query has an arrival rate determined by the Poisson distribution parameter. Each query has only one sample and, in this case, the latency for serving a query is a critical metric.
Hardware/Software Configurations for Virtualized NVIDIA H100 and NVIDIA A100 GPUs
Table 2 shows the hardware configurations used to run workloads on the bare metal and virtualized systems featuring the virtualized H100 GPU card. The most salient difference in the configurations is that the virtual configuration used a virtualized H100 GPU, denoted by GRID H100-80c vGPU. Note that the H100-80c vGPU profile is for time-sliced mode. Both the systems had the same 2x H100-PCIE-80GB physical GPUs. The benchmarks were optimized with NVIDIA TensorRT.
Bare Metal | Virtual Configuration | |
System | Dell PowerEdge R750xa | Dell PowerEdge R750xa
|
Processors | 2x Intel Xeon Platinum 8358 | 2x Intel Xeon Platinum 8358 |
Logical Processor | 128 | 16 allocated to the VM
(112 available for other VMs/workloads) |
GPU | 2x NVIDIA H100-PCIE-80GB | 2x NVIDIA GRID H100-80c vGPU |
Memory | 256GB | 128GB |
Storage | 3.0TB NVMe SSD
|
3.0TB NVMe SSD
|
OS | Ubuntu 20.04
|
Ubuntu 20.04 VM in vSphere 8.0.1
|
NVIDIA AIE VIB for ESXi | – | vGPU GRID Driver 525.85.07 |
CUDA | 12 | 12 |
TensorRT | 8.6.0 | 8.6.0 |
MLPerf Inference | v3.0 | v3.0 |
Table 2. Bare metal vs. virtual server configurations for virtualized H100
Table 3 describes the hardware configurations used for bare metal and virtual runs for virtualized A100. The most salient difference in the configurations is that the virtual configuration used a virtualized A100 GPU denoted by GRID A100-80c vGPU. Note that the A100-80c vGPU profile is for time-sliced mode. Both the systems had the same 4x A100-SXM–80GB physical GPUs. The benchmarks were optimized with NVIDIA TensorRT.
Bare Metal | Virtual Configuration | |
System | Dell PowerEdge XE8545 | Dell PowerEdge XE8545
|
Processors | 2x AMD EPYC 7543 | 2x AMD EPYC 7543
|
Logical Processor | 128 | 16 allocated to the VM
(112 available for other VMs/workloads) |
GPU | 4x NVIDIA A100-SXM-80GB | 4x NVIDIA GRID A100-80c vGPU |
Memory | 1 TB | 128GB |
Storage | 3.0 TB NVME SSD
|
3.0 TB NVME SSD
|
OS | Ubuntu 20.04
|
Ubuntu 20.04 VM in vSphere 8.0.1
|
NVIDIA AIE VIB for ESXi | – | vGPU GRID Driver 525.85.07 |
CUDA | 12 | 12 |
TensorRT | 8.6.0 | 8.6.0 |
MLPerf Inference | V3.0 | V3.0 |
Table 3. Bare metal vs. virtual server configurations for virtualized A100
MLPerf Inference Performance Results for Bare Metal and Virtual Configurations
Figures 1 and 2 compare the throughput (queries processed per second) of MLPerf Inference benchmark workloads using vSphere 8.0.1 with NVIDIA vGPU H100-80c against the bare metal H100 GPU configuration. The bare metal baseline is set to 1.000, and the virtualized result is presented as a relative percentage of the baseline. vSphere with NVIDIA vGPUs delivers near bare metal performance ranging from 94.4% to 105% for Offline and Server scenarios when using the MLPerf Inference benchmarks.
Figure 1. Normalized throughput for Offline scenario (qps): vGPU 2x H100-80c vs. bare metal 2x H100
Figure 2. Normalized throughput for Server scenario (qps): vGPU 2x H100-80c vs. bare metal 2x H100
Table 4 shows throughput numbers in queries per second for MLPerf Inference benchmarks.
Benchmark | Bare Metal 2x H100 | vGPU 2x H100-80c | vGPU/BM |
---|---|---|---|
RetinaNet Server | 1852.24 | 1772.81 | 0.96 |
RetinaNet Offline | 1892.09 | 1800.60 | 0.95 |
3d-UNET-99 Offline | 9.05 | 8.76 | 0.97 |
3d-UNET-99.9 Offline | 9.05 | 8.76 | 0.97 |
RNNT Server | 32004.00 | 31131.20 | 0.97 |
RNNT Offline | 33741.00 | 32771.40 | 0.97 |
Table 4. vGPU 2x H100-80c vs. bare metal 2x H00 throughput (queries per second)
The above results are published by MLCommons in the closed division with the submitter id of 3.0-0017.
Figures 3 and 4 compare throughput (queries processed per second) for MLPerf Inference benchmarks using vSphere 8.0.1 with NVIDIA vGPU A100-80c against the bare metal A100 GPU configuration. The bare metal baseline is set to 1.000, and the virtualized result is presented as a relative percentage of the baseline. vSphere with NVIDIA vGPUs delivers near bare metal performance ranging from 94.4% to 105% for offline and server scenarios for MLPerf Inference benchmarks.
Figure 3. Normalized throughput for Offline scenario (qps): vGPU 4x A100-80c vs. bare metal 4x H100-80GB
Figure 4. Normalized throughput for Server scenario (qps): vGPU 4x A100-80c vs. bare metal 4x A100
Table 5 shows throughput numbers for MLPerf Inference benchmarks.
Benchmark | Bare Metal 4x A100 | vGPU 4x A100-80c |
vGPU/BM |
Bert Server | 13597.00 | 13497.90 | 0.99 |
Bert Offline | 15090.00 | 14923.10 | 0.99 |
Bert HighA Server | 7004.00 | 7004.02 | 1.00 |
Bert HighA Offline | 7880.00 | 7767.84 | 0.99 |
RetinaNet Server | 2848.84 | 2798.93 | 0.98 |
RetinaNet Offline | 2910.78 | 2876.56 | 0.99 |
RNNT Server | 54000.40 | 51001.80 | 0.94 |
RNNT Offline | 57084.00 | 56174.00 | 0.98 |
3d-UNET-99 Offline | 14.44 | 15.10 | 1.05 |
3d-UNET-99.9 Server | 14.44 | 15.10 | 1.05 |
Table 5. vGPU4x A100-80C vs. bare metal 4x A100-80GB throughput (queries/second)
The above results are published by MLCommons in the closed division with the submitter id 3.0-0018.
Takeaways
- VMware+NVIDIA AI Enterprise using NVIDIA vGPUs and NVIDIA AI software delivers from 94% to 105% of the bare metal performance for MLPerf Inference v3.0 benchmarks.
- VMware achieved this performance with only 16 logical CPU cores out of 128 available CPU cores, leaving 112 logical CPU cores for other jobs in the datacenter. This is the extraordinary power of virtualization!
- VMware vSphere combines the power of NVIDIA vGPUs and NVIDIA AI software with the datacenter management benefits of virtualization.
Acknowledgements
VMware thanks Liz Raymond and Yunfan Han of Dell; and Charlie Huang, Manvendar Rawat, and Jason Kyungho Lee of NVIDIA for providing the hardware and software for VMware’s MLPerf Inference submission. The authors would like to acknowledge Juan Garcia-Rovetta and Tony Lin of VMware for their management support.
References
- MLCommons
https://mlcommons.org/ - MLCommons April 05, 2023 – Inference: Datacenter v3.0 Results
https://mlcommons.org/en/inference-datacenter-30/ - MLCommons September 22, 2021 – Inference: Datacenter v1.1 Results
https://mlcommons.org/en/inference-datacenter-11/ - NVIDIA Ampere Architecture
https://www.nvidia.com/en-us/data-center/ampere-architecture/ - NVIDIA Hopper Architecture In-Depth
https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth - NVIDIA Ampere Architecture In-Depth
https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth - NVIDIA Docs Hub
https://docs.nvidia.com/ai-enterprise/latest/user-guide/index.html#supported-gpus-grid-vgpu - MIG or vGPU Mode for NVIDIA Ampere GPU: Which One Should I Use? (Part 1 of 3)
https://blogs.vmware.com/performance/2021/09/mig-or-vgpu-part1.html - Introduction to MLPerf™ Inference v1.1 with Dell EMC Servers
https://infohub.delltechnologies.com/p/introduction-to-mlperf-tm-inference-v1-1-with-dell-emc-servers - MLPerf Inference Virtualization in VMware vSphere Using NVIDIA vGPUs
https://blogs.vmware.com/performance/2020/12/mlperf-inference-virtualization-in-vmware-vsphere-using-nvidia-vgpus.html - NVIDIA T4
https://www.nvidia.com/en-us/data-center/tesla-t4/ - NVIDIA Triton
https://developer.nvidia.com/nvidia-triton-inference-server - NVIDIA TensorRT
https://developer.nvidia.com/tensorrt - NVIDIA Turing GPU Architecture
https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf - V. J. Reddi et al., “MLPerf Inference Benchmark,” 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), Valencia, Spain, 2020, pp. 446-459, doi: 10.1109/ISCA45697.2020.00045.