GPU AI Machine Learning Performance Virtualization

The wizards behind the curtain: Broadcom, NVIDIA, and Dell power the magic of virtualized ML/AI and deliver near or better than bare metal performance, as shown with MLPerf Inference 4.0

By Uday Kurkure, Lan Vu, and Hari Sivaraman

As a leader in virtualization technologies, VMware by Broadcom has empowered global enterprises by providing innovative infrastructure solutions for data center management that help customers build, run, and manage applications efficiently, securely, and flexibly. For machine learning (ML) and artificial intelligence (AI) workloads, our software solutions work with most hardware vendors to enable these workloads at scale. 

Broadcom, Dell, and NVIDIA have collaborated to bring the magic of virtualization to accelerator data centers. We prove our top wizardry by presenting the latest benchmark results using the MLPerf Inference v4.0 data center benchmark suite. In addition to legacy benchmarks, we have submitted outstanding results for the new Stable Diffusion (text-to-images) benchmark. Our results deliver near bare metal or better performance with the added virtualization benefits of data center management.

Hardware and software

We ran MLPerf Inference workloads on Dell XE9680 with 8x virtualized NVIDIA SXM H100 80GB GPUs (test scenario 1) and Dell R760 with 2x virtualized NVIDIA L40S 80GB GPUs (test scenario 2) with vSphere 8.02 and NVIDIA vGPUs. The virtual machines used in our tests were allocated with only 32 out of 120–224 available CPUs and 128GB out of 1TB–1.5TB available memory. This means we used just a fraction of the system’s capacity. 

Tables 1 and 2 show the hardware configurations used to run the LLM workloads on the bare metal and virtualized systems. The benchmarks were optimized with NVIDIA TensorRT-LLM. TensorRT-LLM consists of the TensorRT deep learning compiler and includes optimized kernels, pre- and post-processing steps, and multi-GPU/multi-node communication primitives for groundbreaking performance on NVIDIA GPUs.

Table 1. Hardware and software for test scenario 1 

  Bare metal Virtual
System Dell PowerEdge XE9680 Dell PowerEdge XE9680
Processors 2x Intel Xeon Platinum 8480  2x Intel Xeon Platinum 8480 
Logical processors 228 32 (14%) allocated to the VM for inferencing
(196 available for other VMs/workloads)
GPU 8 x H100-SXM-80GB  8 x NVIDIA GRID H100-SXM-80c vGPU (full profile)
Memory 1TB 128GB allocated for inferencing VM out of 1TB (12.8%)
Storage 7.68TB NVMe SSD 5.82TB NVMe SSD
OS Ubuntu 22.04 Ubuntu 22.04 VM in vSphere 8.0.2
NVIDIA AI Enterprise VB for ESXi NVIDIA Driver 535.54.03 NVIDIA vGPU GRID Driver 550.53
CUDA 12.2 12.2 CUDA and Linux vGPU Driver 550.53
TensorRT TensorRT 9.3.0 TensorRT 9.3.0
Special VM settings N/A pciPassthru0.cfg.enable_uvm = “1”

 

Table 2. Hardware and software for test scenario 2 

  Bare Metal Virtual
System Dell PowerEdge R760 Dell PowerEdge R760
Processors 2x Intel Xeon Platinum 8580 2x Intel Xeon Platinum 8580
Logical processors 240 32 (13%) allocated to the VM for inferencing
(208 available for other VMs/workloads)
GPU 2x L40S 48GB 2x NVIDIA GRID L40S 48c vGPU (full profile)
Memory 1.5TB 128GB allocated for inferencing VM of out 1.5TB (8.5%)
Storage 6TB NVMe SSD 5.82TB NVMe SSD
OS Ubuntu 22.04 Ubuntu 22.04 VM in vSphere 8.0.2
NVIDIA AI Enterprise VB for ESXi NVIDIA Driver 545.23.08 NVIDIA vGPU GRID Driver 535.129.03
CUDA 12.3 12.2 CUDA and Linux vGPU Driver 535.129.03
TensorRT TensorRT 9.3.0 TensorRT 9.3.0
Special VM settings N/A pciPassthru0.cfg.enable_uvm = “1”

 

Test scenario 1: Performance comparison of virtualized vs native ML/AI workloads, featuring a Dell PowerEdge XE9680 vSphere host/bare metal server with 8 NVIDIA H100 GPUs

Figures 1 and 2 show the performance results of test scenario 1, which compares a bare metal configuration with vSphere on a Dell PowerEdge XE9680 with 8 H100 GPUs. The bare metal baseline is set to 1.0, and the virtualized result is presented as a relative percentage of the baseline. Compared to the bare metal results, vSphere with NVIDIA vGPUs delivers near bare metal performance, ranging from 95% to 104% for the Offline and Server scenarios of the MLPerf Inference 4.0 benchmark. 

In an Offline scenario, the workload generator (LoadGen) sends all queries to the system under test at the start of the run. In a Server scenario, LoadGen sends new queries to the system under test according to a Poisson distribution. This is shown in table 3.

Table 3. Server vs Offline scenarios in the MLPerf Inference 4.0 benchmark

Scenario Query generation Duration Samples per query Latency constraint Tail latency Performance metric
Server LoadGen sends new queries to the SUT according to a Poisson distribution 270,336 queries and 60 seconds 1 Benchmark-specific 99% Maximum Poisson throughput parameter supported
Offline LoadGen sends all queries to the SUT at start 1 query and 60 seconds At least 24,576 None N/A Measured throughput

Source: MLPerf Inference: Datacenter Benchmark Suite Results, “Scenarios and Metrics”

We observed the following virtualized results for the MLPerf Inference 4.0 benchmarks:

  • Language and image: 4% better than bare metal
    • GPT-J 6 billion LLM summarization model (99% and 99.9%) with the CNN-DailyMail news text summarization dataset
    • SDXL 1.0 (Stable Diffusion) image generation model with the COCO-2014 dataset
  • Other benchmarks, as shown in the following figures and described at MLPerf Inference: Datacenter Benchmark Suite Results, “Benchmarks”: 0%–1% overhead compared to bare metal

Note: The MLCommons verified the results of Retinanet, BERT-large (99% and 99.9%), RNNT, 3D UNET (99% and 99.9%), and SDXL 1.0. The results are shown at MLPerf Inference: Datacenter Benchmark Suite Results, “Results.” We ran the GPT-J 6 billion (99% and 99.9%) benchmark with the same hardware and software systems mentioned above, but MLCommons did not verify these results, and we didn’t include them as part of our MLPerf Inference v4.0 submission.

Figure 1. vSphere vs bare metal for offline inference results on Dell XE9680 for Retinanet, 3D UNET (99% and 99.9%), BERT-large (99%),  GPT-J 6B (99% and 99.9%), and SDXL 1.0 (Stable Diffusion)

Figure 2. vSphere vs bare metal for server inference results on Dell XE9680 for Retinanet, BERT-large (99%), GPT-J 6B (99% and 99.9%), and SDXL 1.0 (Stable Diffusion)

Test scenario 2: Performance comparison of virtualized vs native ML/AI workloads, featuring a Dell PowerEdge R760 vSphere host/bare metal server with 2 NVIDIA L40S GPUs

For test scenario 2, we observed the virtual results to have an overhead of 2%–10% compared to the bare metal results for the Offline and Server scenarios (as shown in figures 3 and 4).

Figure 3. vSphere vs bare metal for Offline inference results on Dell R760 for Retinanet, 3D UNET (99% and 99.9%), RNNT, and BERT-large (99%)

Figure 4. vSphere vs bare metal for Server inference results on Dell R760 for Retinanet, RNNT, and BERT-large (99%)

Conclusion

We used only 32 logical CPU cores and 128GB of memory for this inference benchmarking—that’s a key benefit of virtualization. This lets you use the remaining CPU and memory capacity on the same systems to run other workloads, save on the cost of ML/AI infrastructure, and leverage the virtualization benefits of vSphere for managing data centers. 

The results of our benchmark testing show that vSphere 8.0.2 with NVIDIA virtualized GPUs is in the Goldilocks Zone for ML/AI workloads. vSphere also makes it easy to manage and process workloads quickly using NVIDIA vGPUs, flexible NVLinks to connect devices, and vSphere virtualization technologies to use AI/ML infrastructure for graphics, training, and inference. Virtualization lowers the total cost of ownership (TCO) of an ML/AI infrastructure by allowing you to share expensive hardware resources among multiple tenants.

Acknowledgments: We would especially like to thank ​​Jia Dai, Manvendar Rawat (NVIDIA) , Frank Han, Vinay HN, Jay Engh, Nirmala Sundararajan (Dell), Juan Garcia-Rovetta, and Julie Brodeur (Broadcom) for their help and support in completing this work.