The wizards behind the curtain: Broadcom, NVIDIA, and Dell power the magic of virtualized ML/AI and deliver near or better than bare metal performance, as shown with MLPerf Inference 4.0

By Uday Kurkure, Lan Vu, and Hari Sivaraman

As a leader in virtualization technologies, VMware by Broadcom has empowered global enterprises by providing innovative infrastructure solutions for data center management that help customers build, run, and manage applications efficiently, securely, and flexibly. For machine learning (ML) and artificial intelligence (AI) workloads, our software solutions work with most hardware vendors to enable these workloads at scale.

Broadcom, Dell, and NVIDIA have collaborated to bring the magic of virtualization to accelerator data centers. We prove our top wizardry by presenting the latest benchmark results using the MLPerf Inference v4.0 data center benchmark suite. In addition to legacy benchmarks, we have submitted outstanding results for the new Stable Diffusion (text-to-images) benchmark. Our results deliver near bare metal or better performance with the added virtualization benefits of data center management.

Hardware and software

We ran MLPerf Inference workloads on Dell XE9680 with 8x virtualized NVIDIA SXM H100 80GB GPUs (test scenario 1) and Dell R760 with 2x virtualized NVIDIA L40S 80GB GPUs (test scenario 2) with vSphere 8.02 and NVIDIA vGPUs. The virtual machines used in our tests were allocated with only 32 out of 120–224 available CPUs and 128GB out of 1TB–1.5TB available memory. This means we used just a fraction of the system’s capacity.

Tables 1 and 2 show the hardware configurations used to run the LLM workloads on the bare metal and virtualized systems. The benchmarks were optimized with NVIDIA TensorRT-LLM. TensorRT-LLM consists of the TensorRT deep learning compiler and includes optimized kernels, pre- and post-processing steps, and multi-GPU/multi-node communication primitives for groundbreaking performance on NVIDIA GPUs.

Table 1. Hardware and software for test scenario 1

	Bare metal	Virtual
System	Dell PowerEdge XE9680	Dell PowerEdge XE9680
Processors	2x Intel Xeon Platinum 8480	2x Intel Xeon Platinum 8480
Logical processors	228	32 (14%) allocated to the VM for inferencing (196 available for other VMs/workloads)
GPU	8 x H100-SXM-80GB	8 x NVIDIA GRID H100-SXM-80c vGPU (full profile)
Memory	1TB	128GB allocated for inferencing VM out of 1TB (12.8%)
Storage	7.68TB NVMe SSD	5.82TB NVMe SSD
OS	Ubuntu 22.04	Ubuntu 22.04 VM in vSphere 8.0.2
NVIDIA AI Enterprise VB for ESXi	NVIDIA Driver 535.54.03	NVIDIA vGPU GRID Driver 550.53
CUDA	12.2	12.2 CUDA and Linux vGPU Driver 550.53
TensorRT	TensorRT 9.3.0	TensorRT 9.3.0
Special VM settings	N/A	pciPassthru0.cfg.enable_uvm = “1”

Table 2. Hardware and software for test scenario 2

	Bare Metal	Virtual
System	Dell PowerEdge R760	Dell PowerEdge R760
Processors	2x Intel Xeon Platinum 8580	2x Intel Xeon Platinum 8580
Logical processors	240	32 (13%) allocated to the VM for inferencing (208 available for other VMs/workloads)
GPU	2x L40S 48GB	2x NVIDIA GRID L40S 48c vGPU (full profile)
Memory	1.5TB	128GB allocated for inferencing VM of out 1.5TB (8.5%)
Storage	6TB NVMe SSD	5.82TB NVMe SSD
OS	Ubuntu 22.04	Ubuntu 22.04 VM in vSphere 8.0.2
NVIDIA AI Enterprise VB for ESXi	NVIDIA Driver 545.23.08	NVIDIA vGPU GRID Driver 535.129.03
CUDA	12.3	12.2 CUDA and Linux vGPU Driver 535.129.03
TensorRT	TensorRT 9.3.0	TensorRT 9.3.0
Special VM settings	N/A	pciPassthru0.cfg.enable_uvm = “1”

Test scenario 1: Performance comparison of virtualized vs native ML/AI workloads, featuring a Dell PowerEdge XE9680 vSphere host/bare metal server with 8 NVIDIA H100 GPUs

Figures 1 and 2 show the performance results of test scenario 1, which compares a bare metal configuration with vSphere on a Dell PowerEdge XE9680 with 8 H100 GPUs. The bare metal baseline is set to 1.0, and the virtualized result is presented as a relative percentage of the baseline. Compared to the bare metal results, vSphere with NVIDIA vGPUs delivers near bare metal performance, ranging from 95% to 104% for the Offline and Server scenarios of the MLPerf Inference 4.0 benchmark.

In an Offline scenario, the workload generator (LoadGen) sends all queries to the system under test at the start of the run. In a Server scenario, LoadGen sends new queries to the system under test according to a Poisson distribution. This is shown in table 3.

Table 3. Server vs Offline scenarios in the MLPerf Inference 4.0 benchmark

Scenario	Query generation	Duration	Samples per query	Latency constraint	Tail latency	Performance metric
Server	LoadGen sends new queries to the SUT according to a Poisson distribution	270,336 queries and 60 seconds	1	Benchmark-specific	99%	Maximum Poisson throughput parameter supported
Offline	LoadGen sends all queries to the SUT at start	1 query and 60 seconds	At least 24,576	None	N/A	Measured throughput

Source: MLPerf Inference: Datacenter Benchmark Suite Results, “Scenarios and Metrics”

We observed the following virtualized results for the MLPerf Inference 4.0 benchmarks:

Language and image: 4% better than bare metal
- GPT-J 6 billion LLM summarization model (99% and 99.9%) with the CNN-DailyMail news text summarization dataset
- SDXL 1.0 (Stable Diffusion) image generation model with the COCO-2014 dataset
Other benchmarks, as shown in the following figures and described at MLPerf Inference: Datacenter Benchmark Suite Results, “Benchmarks”: 0%–1% overhead compared to bare metal

Note: The MLCommons verified the results of Retinanet, BERT-large (99% and 99.9%), RNNT, 3D UNET (99% and 99.9%), and SDXL 1.0. The results are shown at MLPerf Inference: Datacenter Benchmark Suite Results, “Results.” We ran the GPT-J 6 billion (99% and 99.9%) benchmark with the same hardware and software systems mentioned above, but MLCommons did not verify these results, and we didn’t include them as part of our MLPerf Inference v4.0 submission.

Figure 1. vSphere vs bare metal for offline inference results on Dell XE9680 for Retinanet, 3D UNET (99% and 99.9%), BERT-large (99%), GPT-J 6B (99% and 99.9%), and SDXL 1.0 (Stable Diffusion)

Figure 2. vSphere vs bare metal for server inference results on Dell XE9680 for Retinanet, BERT-large (99%), GPT-J 6B (99% and 99.9%), and SDXL 1.0 (Stable Diffusion)

Test scenario 2: Performance comparison of virtualized vs native ML/AI workloads, featuring a Dell PowerEdge R760 vSphere host/bare metal server with 2 NVIDIA L40S GPUs

For test scenario 2, we observed the virtual results to have an overhead of 2%–10% compared to the bare metal results for the Offline and Server scenarios (as shown in figures 3 and 4).

Figure 3. vSphere vs bare metal for Offline inference results on Dell R760 for Retinanet, 3D UNET (99% and 99.9%), RNNT, and BERT-large (99%)

Figure 4. vSphere vs bare metal for Server inference results on Dell R760 for Retinanet, RNNT, and BERT-large (99%)

Conclusion

We used only 32 logical CPU cores and 128GB of memory for this inference benchmarking—that’s a key benefit of virtualization. This lets you use the remaining CPU and memory capacity on the same systems to run other workloads, save on the cost of ML/AI infrastructure, and leverage the virtualization benefits of vSphere for managing data centers.

The results of our benchmark testing show that vSphere 8.0.2 with NVIDIA virtualized GPUs is in the Goldilocks Zone for ML/AI workloads. vSphere also makes it easy to manage and process workloads quickly using NVIDIA vGPUs, flexible NVLinks to connect devices, and vSphere virtualization technologies to use AI/ML infrastructure for graphics, training, and inference. Virtualization lowers the total cost of ownership (TCO) of an ML/AI infrastructure by allowing you to share expensive hardware resources among multiple tenants.

Acknowledgments: We would especially like to thank Jia Dai, Manvendar Rawat (NVIDIA) , Frank Han, Vinay HN, Jay Engh, Nirmala Sundararajan (Dell), Juan Garcia-Rovetta, and Julie Brodeur (Broadcom) for their help and support in completing this work.

The wizards behind the curtain: Broadcom, NVIDIA, and Dell power the magic of virtualized ML/AI and deliver near or better than bare metal performance, as shown with MLPerf Inference 4.0

Hardware and software

Test scenario 1: Performance comparison of virtualized vs native ML/AI workloads, featuring a Dell PowerEdge XE9680 vSphere host/bare metal server with 8 NVIDIA H100 GPUs

Test scenario 2: Performance comparison of virtualized vs native ML/AI workloads, featuring a Dell PowerEdge R760 vSphere host/bare metal server with 2 NVIDIA L40S GPUs

Conclusion

Related Articles

Extreme Performance Series 2023: vSphere 8.0 U1 Enhancements for vMotion of vGPU Powered VMs

Extreme Performance Series 2023: vSphere is the Goldilocks Zone for Machine Learning and AI Workloads