Home Page Ecosystem Private AI Technical Technology Partners VMware Private AI

VMware Private AI Foundation with NVIDIA on HGX Servers for Inference

By Yuankun Fu, Agustin Malanco, Ramesh Radhakrishnan

As artificial intelligence (AI) continues to transform industries, enterprises are increasingly seeking secure, cost-efficient, and high-performance solutions to deploy generative AI (GenAI) applications on-premises. VMware Private AI Foundation with NVIDIA provides a robust platform for deploying GenAI workloads in private cloud environments, addressing critical concerns around privacy, compliance, and resource efficiency. At the NVIDIA GTC 2025 AI conference, Broadcom and NVIDIA released a joint Reference Architecture that describes the core components, infrastructure choices, deployment considerations, and performance validation for VMware Private AI Foundation with NVIDIA on HGX Servers. The technical paper offers organizations a comprehensive guide to deploy and optimize AI inference workloads in a private cloud environment.

Why Choose VMware Private AI Foundation with NVIDIA?

Enterprises face growing demands to customize large language models (LLMs), run inference workloads securely, and optimize costs while maintaining compliance. To address this, Broadcom and NVIDIA have collaborated to develop the joint GenAI platform called VMware Private AI Foundation with NVIDIA. This GenAI platform enables enterprises to fine-tune LLM models, deploy RAG workflows, and run inference workloads in their data centers, addressing privacy, choice, cost, performance, and compliance concerns:

  • Enable privacy, security, and compliance of AI models: VMware Private AI Foundation with NVIDIA’s architectural approach for AI services enables privacy and control of corporate data and integrated security and management.
  • Simplify GenAI deployment and optimize costs: VMware Private AI Foundation with NVIDIA empowers enterprises to simplify deployment and achieve the optimum cost-effective solution for their GenAI models. 
  • Accelerate performance for any LLM: Broadcom and NVIDIA have designed the software and hardware to extract the maximum performance for your GenAI models. The integrated capabilities built into the VMware Cloud Foundation (VCF) platform include VMware Distributed Resource Scheduler (DRS), virtualization and pooling of GPUs, GPU monitoring, live migration, instant cloning (which gives the ability to deploy multi-node clusters with preloaded models within a few seconds), and scaling of GPU input/output with NVIDIA NVLink and NVIDIA NVSwitch.

Core Components of the Reference Architecture

The solution integrates cutting-edge hardware and software technologies to create a seamless environment for GenAI deployment.

Hardware Infrastructure

  • NVIDIA-Certified HGX Systems: Equipped with 8x H100 or H200 GPUs, NVSwitches, and NVLinks for high-speed inter-GPU communication within a single server.
  • NVIDIA Spectrum-X Networking: Ethernet-based networking ensures reliable, cost-effective, low-latency communication optimized for scalable and flexible deployments.

Software Stack

  • VMware Cloud Foundation
    • Provides virtualized compute, networking, and storage with VMware vSphere, NSX, and vSAN.
    • VMware Kubernetes Service (VKS) supports containerized AI workloads.
    • VMware Private AI Foundation with NVIDIA
      • Preconfigured deep learning virtual machines (DLVMs) simplify GPU-enabled VM image deployment.
      • VCF Automation enables automation and self-service.
      • VCF Operations allows for monitoring the infrastructure and provides advanced analytics, logging, and diagnostics to enhance efficiency and performance.
  • NVIDIA AI Enterprise
    • NVIDIA vGPU (C-Series) technology enables efficient GPU resource pooling and sharing.
    • Includes NIM for LLM inference and NeMo microservices for LLM customization.
    • Optimizes TensorRT for inference performance.

Key Features

  • GPU sharing through NVIDIA vGPU (C-Series) technology.
  • Model governance allows data scientists to test, evaluate, and store pre-trained LLMs or containers that are deemed safe and suitable for business use.
  • Vector databases provided by the Data Services Manager (DSM) for RAG workflows.
  • Self-service automation enables the provisioning of DLVMs for model development and Kubernetes clusters for production scaling.
  • Integration with the NVIDIA NGC for pre-tested GPU-optimized containers.
  • GPU monitoring via VCF Operations provides real-time visibility into hardware performance.
  • Efficient utilization of CPU cores and memory ensures that other data center tasks remain unaffected.
  • Software-defined storage (SDS) and cloud-native storage with vSAN provides scalable and resilient storage for AI workloads, ensuring efficient data management across AI/ML pipelines.
  • Software-defined networking (SDN) with NSX enables secure, high-performance networking for AI inference, providing micro-segmentation, load balancing, and seamless connectivity across on-premises and cloud environments.

Deployment Considerations

Deploying VMware Private AI Foundation with NVIDIA involves several critical steps:

  1. Infrastructure readiness:
    • Ensure VMware ESX hosts are equipped with supported NVIDIA GPUs.
    • Configure VMware Cloud Foundation on vSAN ReadyNodes.
    • Obtain VMware Private AI Foundation add-on licenses.
    • Configure the VI WLD for VMware Private AI Foundation with NVIDIA.
  2. Virtualization setup:
    • Deploy deep learning VMs or GPU-enabled TKG clusters through VCF Automation’s self-service catalog based on workload requirements.
    • Use the Harbor OCI-compliant registry to manage container images in disconnected environments.
  3. Performance optimization:
    • Leverage NVLink and Spectrum-X networking for efficient data throughput.
    • Monitor GPU metrics using VCF Operations.
  4. Resource allocation:
    • Reserve unused CPU cores and memory for other tasks to maximize efficiency.

Performance Validation: Virtualization vs. Bare Metal

To ensure optimal performance, the GenAI-Perf benchmark tool was used to compare virtualized environments against bare-metal setups. Key findings include:

  • Throughput: Virtual GPUs delivered 1%–2% higher throughput than bare metal in certain scenarios.
  • Latency (TTFT): Virtual GPUs exhibited 1%–2% lower latency at moderate concurrency levels but up to 2% higher latency at others.
  • Resource Efficiency:
    • Only 24 out of 208 logical CPU cores were utilized for inference workloads.
    • 256 GB out of 2 TB of memory was consumed, leaving significant resources available for other applications.
Throughput and TTFT Ratio: virtual vs bare metal across concurrency levels

These results demonstrate that virtualization achieves near bare-metal performance for GenAI inference, enabling users to balance near-bare-metal efficiency with the benefits of virtualization.

Real-World Applications and TCO Analysis

The benchmark results can be extrapolated to estimate capacity for real-world applications, including chatbots and virtual assistants, across various tasks such as summarization, generation, and translation. Additionally, enterprises can perform total cost of ownership (TCO) analyses by factoring in infrastructure costs (for example, server hardware, hosting fees, and software licensing) against throughput metrics like cost per input/output token. More details about performance and estimating capacity planning are available in the reference architecture document.

Conclusion

VMware Private AI with NVIDIA empowers enterprises to securely deploy GenAI applications on-premises while optimizing costs and maintaining compliance. By leveraging cutting-edge hardware like NVIDIA H100/H200 GPUs and a robust software stack including VMware Cloud Foundation and NVIDIA AI Enterprise, IT teams can deliver high-performance AI solutions tailored to their unique business needs. Whether you’re fine-tuning LLMs or running inference workloads at scale, this platform provides the tools necessary to future-proof your enterprise AI strategy.

To learn more, download the complete reference architecture.

Acknowledgments
The authors thank Justin Murray, Vrushal Dongre, Shobhit Bhutani, and Julie Brodeur from Broadcom’s VMware Cloud Foundation division and Joe Cullen from NVIDIA for reviewing and improving the paper.