VCF Performance VCF Compute (vSphere)

Broadcom Delivers Near Bare-Metal Performance for Virtualized AI/ML

Virtualizing AI/ML workloads requires only a fraction of the bare-metal resources. This leaves the remaining resources for other workloads while maintaining uncompromising performance and lowering the total cost of ownership at the same time.

Introduction

Generative AI (Gen AI) is rapidly transforming how we create, communicate, and solve problems across industries. Gen AI tools are pushing the boundaries of what’s possible with machine intelligence. As organizations deploy Gen AI models for tasks like text generation, image synthesis, and data analysis; performance, scalability, and resource utilization become critical factors. Choosing the right infrastructure—whether virtualized or bare metal—can significantly impact how effectively these AI workloads run at scale. This blog dives into a performance comparison between virtualized and bare-metal environments for Gen AI workloads.

Broadcom brings the power of virtualized NVIDIA GPUs to the VCF private cloud platform to simplify management of AI-accelerated data centers and enable efficient application development and execution for demanding AI and ML workloads. Broadcom’s VMware software supports various hardware vendors, allowing flexibility and choice while facilitating scalable deployments.

Broadcom and NVIDIA have collaborated to develop a joint Gen AI platform: VMware Private AI Foundation with NVIDIA. This Gen AI platform allows data scientists and others to fine-tune LLM models, deploy RAG workflows, and run inference workloads in their data centers while addressing privacy, choice, cost, performance, and compliance concerns. Built and run on the industry-leading private cloud platform, VMware Cloud Foundation, VMware Private AI Foundation with NVIDIA includes the NVIDIA AI Enterprise, NVIDIA NIM (included with NVIDIA AI Enterprise), NVIDIA LLMs, and other components that provide access to community models (such as Hugging Face models). VMware Cloud Foundation (VCF), VMware’s full-stack private cloud infrastructure solution, offers a secure, comprehensive, and scalable platform for building and operating Gen AI workloads, providing organizations with agility, flexibility, and scalability to meet their evolving business needs.

Summary

Broadcom partnered with NVIDIA, Supermicro, and Dell to showcase virtualization’s benefits (for example, intelligent pooling and sharing AI infrastructure), achieving impressive MLPerf Inference v5.0 results. We demonstrated near bare-metal performance across diverse AI domains—computer vision, medical imaging, and natural language processing—using a 6-billion parameter GPT-J language model. We also achieved outstanding results with the Mixtral-8x7B 56-billion parameter large language model (LLM). Figure 1 shows normalized virtual performance to be near that of bare metal, ranging from 95%–100% on VMware vSphere 8.0 U3 with NVIDIA virtualized GPUs. Virtualization lowers the total cost of ownership (TCO) of an AI/ML infrastructure by allowing you to share expensive hardware resources among multiple tenants with practically no impact on performance. Refer to the official MLCommons Inference 5.0 results for the raw comparison of queries per second or tokens per second.

Virtualized performance is near that of bare metal, ranging from 95%–100% on VMware vSphere 8.0U3 with NVIDIA virtualized GPUs.

Hardware and Software

We  ran MLPerf Inference v5.0 workloads virtualized with VMware vSphere 8.0 U3 on two systems:

  • SuperMicro SuperServer SYS-821GE-TNRT with 8x virtualized NVIDIA SXM H100 80GB GPUs
  • Dell PowerEdge XE9680 with 8x virtualized NVIDIA SXM H100 80GB GPUs

We only needed to allocate a fraction of the bare metal resources to the virtual machines used in our tests.

Tables 1 and 2 show the hardware configurations used to run the LLM workloads on the bare metal and virtualized systems. In all cases, the physical GPU, which is the primary driver of performance for these workloads, was the same in the virtual configuration as it was in the bare metal configuration to which it was being compared.

The benchmarks were optimized with NVIDIA TensorRT-LLM, which consists of the TensorRT deep learning compiler and includes optimized kernels, pre- and post-processing steps, and multi-GPU/multi-node communication primitives for maximum performance on virtualized NVIDIA GPUs.

COMPONENTBARE METALVIRTUAL
SystemSuperMicro GPU A+ SuperServer 
8125GS-TNHR
SuperMicro GPU SuperServer SYS-821GE-TNRT
CPU*2x AMD EPYC 9634 84-core processor2x Intel® Xeon® Platinum 8568Y+
CPU cores16848 (25%) vCPUs allocated to the VM for inferencing
(144 available for other VMs/workloads)
GPU8x H100-SXM-80GB8x NVIDIA GRID H100-SXM-80c vGPU (full profile)
Accelerator interconnect18x 4th Gen NVLink, 900GB/s18x 4th Gen NVLink, 900GB/s
Memory*1.5TB2.5TB allocated for inferencing VM out of 3TB (83.3%)
Storage*7TB NVMe SSD4x 3.37TB NVMe SSD
OSUbuntu 24.04Ubuntu 24.04 VM in vSphere 8.0.3 Build#: 24501832
NVIDIA AI Enterprise VIB for VMware ESXiN/AvGPU_17.4_GA_AIE_ESXi_Host_Drivers for vGPUs
CUDA12.812.8 CUDA and Linux vGPU Driver 550.90
TensorRT10.810.8
Special VM settingsN/ApciPassthru[0-7].cfg.enable_uvm = "1"
Table 1. Hardware and software for SuperMicro GPU SuperServer SYS-821GE-TNRT
* CPU, memory, and storage do not have matching specifications due to hardware availability constraints.
COMPONENTBARE METALVIRTUAL
SystemDell PowerEdge XE9680Dell PowerEdge XE9680
CPU2x Intel Xeon Platinum 84682x Intel Xeon Platinum 8468
CPU cores9664 vCPUs (67%) allocated to the VM for inferencing
(32 available for other VMs/workloads)
GPU8x H100-SXM-80GB8x NVIDIA GRID H100-SXM-80c vGPU (full profile)
Memory2TB1TB allocated for inferencing VM out of 2TB (50%) (the other 1TB available for other VMs/workloads)
Storage4x 1.75TB NVMe SSD4x 1.75TB NVMe SSD
OSUbuntu 24.04Ubuntu 24.04 running in a vSphere 8.0 U3 VM
NVIDIA AI Enterprise VIB for VMware ESXiN/AVMware ESXi 8.0.3 2441450
CUDA12.812.8 CUDA and Linux vGPU Driver 550.127.06
TensorRT10.810.8
Special VM settingsN/ApciPassthru[0-7].cfg.enable_uvm = "1"
Table 2. Hardware and software for test scenario Dell PowerEdge XE9680

Benchmarks

Each benchmark is defined by a dataset and quality target. The following table summarizes the benchmarks in this version of the suite (the rules remain the official source of truth).

AREATASKMODELDATASETQSL SIZEQUALITYSERVER LATENCY CONTRAINTLATEST VERSION AVAILABLE
VisionObject detectionRetinanetOpenImages (800×800)6499% of FP32 (0.3755 mAP)100msv5.0
VisionMedical image segmentation3D UNETKITS 2019 (602x512x512)1699% of FP32 and 99.9% of FP32 (0.86330 mean DICE score)N/Av5.0
LanguageLLM summarizationGPT-J 6BCNN-DailyMail News Text Summarization1336899.9% or 99% of the original FP32 ROUGE 1 – 42.9865 ROUGE 2 – 20.1235 ROUGE L – 29.988120sv5.0
LanguageLLM
text generation
(question answering, math and code generation)
Mixtral 8x7BOpenOrca GSM8K, MBXP1500099% or 99.9% of FP32 (ROUGE 1 – 45.4911, ROUGE 2 – 23.2829, ROUGE L 30.3615, (gsm8k)Accuracy 73.78, (mbxp)Accuracy 60.12)TTFT: 2s & TPOT: 200msv5.0
Table 3. Benchmarks used in the performance testing
Source: https://mlcommons.org/benchmarks/inference-datacenter/

In an Offline scenario, the workload generator (LoadGen) sends all queries to the system under test at the start of the run. In a Server scenario, LoadGen sends new queries to the system under test according to a Poisson distribution. This is shown in table 4.

SCENARIOQUERY GENERATIONDURATIONSAMPLES PER QUERYLATENCY CONSTRAINTTAIL LATENCYPERFORMANCE METRIC
ServerLoadGen sends new queries to the SUT according to a Poisson distribution270,336 queries and 60s1Benchmark-specific99%Maximum Poisson throughput parameter supported
OfflineLoadGen sends all queries to the SUT at start1 query and 60sAt least 24,576NoneN/AMeasured throughput
Table 4. Server vs Offline scenarios in the MLPerf Inference 5.0 benchmark

Performance comparison of virtualized vs bare-metal ML/AI workloads

Featuring a SuperMicro SuperServer SYS-821GE-TNRT and a Dell PowerEdge XE9680 vSphere host/bare metal server with 8 virtualized NVIDIA H100 GPUs

Figure 1 shows the performance results of test scenarios, which compares a bare metal configuration with vSphere on a SuperMicro GPU SuperServer SYS-821GE-TNRT and a Dell PowerEdge XE9680 with virtualized 8x H100 GPU device group linked via NVLinks. The bare metal baseline is set to 1.0, and the virtualized result is presented as a relative percentage of the baseline.

Compared to the bare metal results, vSphere with NVIDIA vGPUs delivers near bare metal performance, ranging from 95%–100% for the Offline and Server scenarios of the MLPerf Inference 5.0 benchmark.

Note that the Mixtral-8x7B performance numbers obtained on Dell PowerEdge XE9686 and all other benchmark numbers were obtained on SuperMicro GPU SuperServer SYS-821GE-TNRT.

Figure 1. Virtualized MLPerf Inference 5.0 performance in VCF 8 vs bare metal performance

Conclusion

The virtualized configurations use only 28.5%–67% of the CPU cores and 50%–83% of the available physical memory to deliver near bare-metal performance—that’s a key benefit of virtualization. This lets you use the remaining CPU and memory capacity on the same systems to run other workloads, save on ML/AI infrastructure costs, and leverage the virtualization benefits of vSphere for managing data centers. Beyond the GPU, virtualization also enables you to pool and share CPU, memory, network and data I/O, which leads to a dramatically lower total cost of ownership (as much as a 3x–5x improvement). The results of our benchmark testing show that vSphere 8.0.3 with NVIDIA virtualized GPUs is in the Goldilocks Zone for AI/ML workloads. vSphere also makes it easy to manage and process workloads quickly using NVIDIA vGPUs, flexible NVLinks to connect devices, and vSphere virtualization technologies to use AI/ML infrastructure for graphics, training, and inference.  Virtualization lowers the total cost of ownership (TCO) of an AI/ML infrastructure by allowing you to share expensive hardware resources among multiple tenants with practically no impact on performance.

Want to learn more about VMware Private AI Foundation with NVIDIA?

  1. Check out the VMware Private AI Foundation with NVIDIA webpage for more resources.
  2. Contact us using this Interest Request form.

Acknowledgments: We would especially like to thank ​​Jia Dai (NVIDIA), Patrick Geary, Eddie Chao, Dexter Bermudez,  Steven Phan (SuperMicro), Frank Han, Jack Tang, Jay Engh, Minesh Patel, Leela Uppuluri (Dell), Juan Garcia-Rovetta, and Julie Brodeur (Broadcom) for their help and support in completing this work. We also thank Chris Wolf, Shobhit Bhutani, and Mark Chuang for their review comments.