Using NVIDIA's AI/ML Frameworks for Generative AI on VMware Cloud Foundation

This article was co-authored by Joe Cullen, Technical Marketing Engineer at NVIDIA and Justin Murray, Technical Marketing, VMware.

Large Language Models are an essential component of Generative AI, which is why Enterprises are rushing to integrate these models into their own datacenters today. LLMs allow future-facing organizations to pivot and build smarter chatbots, create marketing material, summarize large documents and predict business conditions.

Enterprise Generative AI

ChatGPT is a very powerful tool, and many businesses are using it for serious projects. However, enterprises need the ability to customize foundational models with their own proprietary data while controlling access to that data, and scaling into production. AI foundational models are advanced low-cost starting points that enterprises can leverage to accurately generate responses to their domain-specific use case though customization. Various skills and focus areas along with continuous improvement pipelines, ensure enterprises are getting the best responses and model performance needed to deliver AI to their end users.

A diagram of a custom modelDescription automatically generated

Early adopting businesses are finding as they use LLMs, that multiple different experiments are needed to be done by the data scientist to

(a) Choose the correct compute infrastructure for running model customization, in addition to giving due consideration to inference workloads that will come later (and those platforms will be different)

(b) Choose the most appropriate foundation model and the right dataset for customizing the model

(d) Iterate and test models in a simple to use inference playground

(e) Scale out the production inference infrastructure quickly

The data scientist therefore has many moving parts to deal with, from the versions of Python all the way up to the different types of models they can use today. Industry and academic innovation is happening here, especially in the model space, at a very fast pace. It is hard to keep up, even for the experts!

Because of this landscape, data scientists are being overwhelmed, as they have to to re-build or re-factor their LLM platforms and infrastructure almost on a daily basis. When we tested LLMs in-house at VMware, for our own use, we needed to move an entire project from one lab to another – to get more GPU power, for example.

This phenomenon of changing software versions and changing infrastructure is very natural for data science work today and it is facilitated by the VMware Cloud Foundation. It is far easier and quicker to create VMs and containers in Kubernetes within those VMs, than to do such a big change on bare metal. This is a fast-moving environment. IT needs to provision those GPU-enabled VMs quickly and not wait around for a hardware purchase to complete.

The focus in this article is on two parts: (1) VMware Private AI Platform with NVIDIA AI Software, which helps the user to customize, retrain and deploy their models using NVIDIA’s techniques and (2) Underlying virtualized infrastructure, which makes rapid re-factoring easier for the data scientist and provides a robust, managed inference environment.

Production Ready Software for Generative AI

VMware and NVIDIA have collaborated for several years now on enabling virtualized GPUs, high-speed networking and developing ML tools and platforms for the data scientist to use. An outline of the jointly developed VMware and NVIDIA reference architecture for Gen AI is shown below.

A screenshot of a computerDescription automatically generated

VMware Private AI Foundation with NVIDIA

VMware Private AI Foundation with NVIDIA brings together VMware Cloud Foundation (VCF) and the NVIDIA AI Enterprise software, making it easy to go from Gen AI development to production. VMware continues to drive technical innovations at the infrastructure layer to enable enterprises to customize and deploy LLMs for Generative AI. The key technical recommendations for Generative AI deployment are given in the VMware AI-Ready Infrastructure Reference Architecture.

The VMware infrastructure components for Generative AI should be very familiar to users of vSphere today. For example, one of the vSphere technical requirements for customizing LLMs is to use multiple virtualized GPUs and high-speed networking, described in the reference architecture. LLM customization typically requires multiple full vGPU profiles assigned to the VM, providing more GPU memory for model customization.

NOTE: The details, such as the NVIDIA drivers and GPU/network operators are given in the VMware AI-Ready Infrastructure reference architecture.

The NVIDIA AI Enterprise software components used for customizing LLMs includes frameworks, tools as well as a Generative AI Knowledge Base Reference Workflows. Organizations can start their AI journey by using the open, freely available NVIDIA AI frameworks to experiment and pilot. But when they’re ready to move from pilot to production, enterprises can easily transition to a fully managed and secure AI platform with an NVIDIA AI Enterprise subscription. This gives enterprises deploying business critical AI, the assurance of business continuity with NVIDIA Enterprise Support and access to NVIDIA AI experts. The NVIDIA AI Enterprise software suite is fully supported on vSphere. The following sections describe the NVIDIA AI Enterprise tools and software components for Generative AI.

NVIDIA NeMo Framework

NVIDIA NeMo is a game-changer for enterprises looking to leverage generative AI. This an end-to-end, cloud-native framework is used to build, customize, and deploy generative AI models anywhere. It includes training and inferencing containers, which includes libraries such as the pytorch lightning can be used for p-tuning a Foundational Model.

NeMo allows Enterprises to continuously refine LLM models with techniques such as p-tuning, reinforcement learning, supplemented by human feedback. This flexibility enables the development of functional skills, focus on specific domains, to prevent inappropriate responses.

A diagram of a companyDescription automatically generated

NeMo integrates seamlessly with the NVIDIA Triton Inference Server to accelerate the inference process, delivering cutting-edge accuracy, low latency, and high throughput. As part of the NVIDIA AI Enterprise software suite, NeMo is backed by a team of dedicated NVIDIA experts providing unparalleled support and expertise.

NVIDIA TRT-LLM for Model Tuning and Runtime

Once an LLM has been customized, it can be optimized for low latency and high throughput inference using NVIDIA Tensor RT. TRT-LLM is a toolkit to assemble optimized solutions to perform LLM inference. The Python API can be leveraged to define models and compile efficient TensorRT engines for NVIDIA GPUs. Additionally, Python and C++ components can be used to build runtimes to execute those engines. NVIDIA TRT-LLM supports multi-GPU and multi-node configurations (through MPI). TRT-LLM also includes Python and C++ backends for NVIDIA Triton Inference Server to assemble solutions for LLM online serving.

NVIDIA Generative AI Knowledge Base Q&A Blueprint

The NVIDIA Generative AI Knowledge Base Questions and Answering AI Blueprint is a reference example of how to use the aforementioned NVIDIA AI Enterprise components to build a Generative AI chatbot. By leveraging these components, the chatbot is able to accurately answer domain-specific questions, based on the enterprise knowledge-base entities.

The NVIDIA Generative AI Knowledge Base Question and Answer Blueprint leverages an existing AI foundational model that is an open-source community LLM (llama2) and performs prompt-tuning (p-tuning). Adapting an existing AI foundational model is a low-cost solution that enterprises can leverage to accurately generate responses to your specific enterprise use case. Once the model has been p-tuned, the model is then chained to a vector database by using Langchain. This allows multiple LLM applications to talk to the LLM and the answers are based upon real Enterprise datasources.

This AI reference workflow contains:

NVIDIA TRT-LLM for low latency and high throughput inference for LLMs
Langchain and vector database
Cloud Native deployable bundle packaged as helm charts
Guidance on performing training and customization of the AI solution to fit your specific use case (example LoRA finetuning or p-tuning)

A diagram of a cloud computing systemDescription automatically generated

VMware Cloud Foundation Support for Generative AI

In keeping with the VMware+NVIDIA AI-Ready Enterprise strategy, VMware continues to drive technical innovations at the infrastructure layer to enable enterprises to train and deploy LLMs for Generative AI. The key technical recommendations for Generative AI deployment are provided within VMware’s AI-Ready Infrastructure Reference Architecture and we look at a few of those new features here.

LLM customization and inference pipelines are orchestrated using Kubernetes. This is the container orchestrator technology platform of choice for VMware and NVIDIA since many LLM tools and platforms make use of Kubernetes technology today. A combination of GPU-capable nodes and non-GPU-capable nodes in your Kubernetes cluster is recommended as well. The following screenshot illustrates a simpler version of a Generative AI cluster, when deployed on VMware Cloud Foundation (VCF).

A screenshot of a computerDescription automatically generated

Dynamic creation of Kubernetes clusters in VCF provides enterprises with the ability to create LLM workload clusters very quickly and adjust their size and content quickly as needed. This provides the flexibility to accommodate various data scientist teams, where each team needs separate ML toolkits or platform versions. Each can co-exist on the same hardware but each team has their own Kubernetes cluster as a self-contained sandbox. With suitable permissions, the devops or VCF administrator user can create a new Kubernetes cluster with one “kubectl apply -f” command – and make changes to the clusters in a very similar way.

Some LLM Foundation models may not fit into the space allowed by one GPU’s framebuffer, therefore LLM customization workloads typically require several GPUs attached to their worker node VM. The “VM class” mechanism in vSphere makes this very easy to do. A VM can be assigned one, two or more virtual GPUs, using NVIDIA vGPU profiles. This type of multi-GPU arrangement is also very easy to do in the VMware vSphere Client.

In the example screenshot below, an administrator or devops person is choosing a device group that will be added to a VM.

A screenshot of a computerDescription automatically generated

Figure 6 : Device Groups for Multiple GPUs with NVLink/NVSwitch viewed in vSphere 8

The device group can represent 2, 4 or 8 GPUs. Additionally, vSphere can now discover whether those GPUs are using NVLink and NVSwitch at the hardware layer. These features make it very easy to construct a powerful VM that can handle very large model sizes – that need multiple GPUs to operate in inference mode.

Architectures that leverage NVSwitch/NVLink have a very high-bandwidth connection directly between all the GPUs. This allows up to 600 GB/second bidirectional bandwidth on Ampere-class GPUs and up to 900 GB/second bidirectional bandwidth on the Hopper-class (H100) GPUs. These levels of speed are particularly needed when large language models are being trained across the set of GPUs available in an HGX class of machines from vendors. The following diagram further illustrates NVSwitch/NVLink-based GPUs in an HGX-style hardware architecture. Several OEM partners of VMware and NVIDIA ship their servers with this HGX architecture today.

A diagram of a computer networkDescription automatically generated

Multi-Node Customization of Models

In some cases, the data science users will want to distribute the model customization across multiple servers with different VMs (referred to as “multi-node training”). This approach is required if a model cannot fit within the combined GPU memory of one server’s collection of GPUs. The infrastructure required for multi-node training/re-training involves both GPUs and high-speed networking cards, like the NVIDIA ConnectX-7 and Bluefield, which have been tested and validated on VMware vSphere or VMware Cloud Foundation. NVIDIA recommends 200Gb/s networking bandwidth between servers/nodes for multi-node training, particularly for east-west traffic since the network carries very frequent updates to the model gradients between the training participants. For north-south networking traffic, i.e. for inference requests coming in from outside the enterprise, the recommendation is to use the Bluefield technology, thus offloading many of the security functions from the CPU.

The following graphic illustrates a multi-node training compute node with four GPUs, ConnectX-7 and Bluefield high-speed networking.

A diagram of a computer networkDescription automatically generated

For Kubernetes environments on VMware platforms, NVIDIA supplies the Network Operator as part of the NVIDIA AI Enterprise suite. The network operator eases the installation and ongoing management of the networking drivers (e.g., the MOFED, peer-to-peer drivers) onto the relevant Kubernetes nodes. More technical details on that multi-node, distributed setup are given in the VMware AI-Ready Infrastructure Reference Architecture.

Summary

This blog captures the key technologies from NVIDIA and VMware that together form a solid basis for your machine learning, LLM and generative AI developments. Together, the two companies’ technologies offer a sound solution for LLM customization and for model deployment in production in inference mode.

We reviewed a growing set of NVIDIA AI Enterprise software and frameworks for all the above from NVIDIA in this article. These NVIDIA AI Enterprise microservices and helm charts depend on robust and scalable infrastructure to drive them. VMware Cloud Foundation provides the key platform that is capable of the quick cluster deployment and rapid change that data scientists need to do their work effectively. VMware and NVIDIA are partnering closely to support you in this new and exciting field of generative AI.

Discover more from VMware Cloud Foundation (VCF) Blog

Subscribe to get the latest posts sent to your email.

Enterprise Generative AI

Production Ready Software for Generative AI

VMware Private AI Foundation with NVIDIA

NVIDIA NeMo Framework

NVIDIA TRT-LLM for Model Tuning and Runtime

NVIDIA Generative AI Knowledge Base Q&A Blueprint

VMware Cloud Foundation Support for Generative AI

Multi-Node Customization of Models

Summary

Discover more from VMware Cloud Foundation (VCF) Blog

Related Articles

VMware at KubeCon North America 2025: Innovation at the Core of Cloud Native

VCF Breakroom Chats Episode 64 - AI Made Simple: Empowering VI Admins to Offer Private AI with VCF 9.0

VCF Breakroom Chats Episode 63 - No Experience Needed: Demystifying Kubernetes for Aspiring Cloud Admins with VCF 9.0