This article was co-authored by Joe Cullen, Technical Marketing Engineer at NVIDIA and Justin Murray, Technical Marketing, VMware.
Large Language Models are an essential component of Generative AI, which is why Enterprises are rushing to integrate these models into their own datacenters today. LLMs allow future-facing organizations to pivot and build smarter chatbots, create marketing material, summarize large documents and predict business conditions.
Enterprise Generative AI
ChatGPT is a very powerful tool, and many businesses are using it for serious projects. However, enterprises need the ability to customize foundational models with their own proprietary data while controlling access to that data, and scaling into production. AI foundational models are advanced low-cost starting points that enterprises can leverage to accurately generate responses to their domain-specific use case though customization. Various skills and focus areas along with continuous improvement pipelines, ensure enterprises are getting the best responses and model performance needed to deliver AI to their end users.
Early adopting businesses are finding as they use LLMs, that multiple different experiments are needed to be done by the data scientist to
(a) Choose the correct compute infrastructure for running model customization, in addition to giving due consideration to inference workloads that will come later (and those platforms will be different)
(b) Choose the most appropriate foundation model and the right dataset for customizing the model
(c) Enrich LLM to answer questions from an Enteprise Knowledge Base
(d) Iterate and test models in a simple to use inference playground
(e) Scale out the production inference infrastructure quickly
The data scientist therefore has many moving parts to deal with, from the versions of Python all the way up to the different types of models they can use today. Industry and academic innovation is happening here, especially in the model space, at a very fast pace. It is hard to keep up, even for the experts!
Because of this landscape, data scientists are being overwhelmed, as they have to to re-build or re-factor their LLM platforms and infrastructure almost on a daily basis. When we tested LLMs in-house at VMware, for our own use, we needed to move an entire project from one lab to another – to get more GPU power, for example.
This phenomenon of changing software versions and changing infrastructure is very natural for data science work today and it is facilitated by the VMware Cloud Foundation. It is far easier and quicker to create VMs and containers in Kubernetes within those VMs, than to do such a big change on bare metal. This is a fast-moving environment. IT needs to provision those GPU-enabled VMs quickly and not wait around for a hardware purchase to complete.
The focus in this article is on two parts: (1) VMware Private AI Platform with NVIDIA AI Software, which helps the user to customize, retrain and deploy their models using NVIDIA’s techniques and (2) Underlying virtualized infrastructure, which makes rapid re-factoring easier for the data scientist and provides a robust, managed inference environment.
Production Ready Software for Generative AI
VMware and NVIDIA have collaborated for several years now on enabling virtualized GPUs, high-speed networking and developing ML tools and platforms for the data scientist to use. An outline of the jointly developed VMware and NVIDIA reference architecture for Gen AI is shown below.
VMware Private AI Platform
VMware Private AI Platform with NVIDIA brings together VMware Cloud Foundation and NVIDIA’s AI software, making it easy to go from Gen AI development to production. VMware continues to drive technical innovations at the infrastructure layer to enable enterprises to customize and deploy LLMs for Generative AI. The key technical recommendations for Generative AI deployment are given in the Baseline Reference Architecture document.
The key VMware infrastructure components for Generative AI should be very familiar to users of vSphere today. For example, one of the vSphere technical requirements for customizing LLMs is to use multiple virtualized GPUs and high-speed networking, described in the baseline reference architecture. LLM customization typically requires multiple full vGPU profiles assigned to the VM, providing more GPU memory for model customization.
NOTE: The details such as the NVIDIA drivers and GPU/network operators are given in the baseline reference architecture.
The NVIDIA AI Enterprise software components used for customizing LLMs includes frameworks, tools as well as a Generative AI Knowledge Base Reference Workflows. Organizations can start their AI journey by using the open, freely available NVIDIA AI frameworks to experiment and pilot. But when they’re ready to move from pilot to production, enterprises can easily transition to a fully managed and secure AI platform with an NVIDIA AI Enterprise subscription. This gives enterprises deploying business critical AI, the assurance of business continuity with NVIDIA Enterprise Support and access to NVIDIA AI experts. The NVIDIA AI Enterprise software suite is fully supported on vSphere. The following sections describe the NVIDIA AI Enterprise tools and software components for Generative AI.
NVIDIA NeMo Framework
NVIDIA NeMo is a game-changer for enterprises looking to leverage generative AI. This an end-to-end, cloud-native framework is used to build, customize, and deploy generative AI models anywhere. It includes training and inferencing containers, which includes libraries such as the pytorch lightning can be used for p-tuning a Foundational Model.
NeMo allows Enterprises to continuously refine LLM models with techniques such as p-tuning, reinforcement learning, supplemented by human feedback. This flexibility enables the development of functional skills, focus on specific domains, to prevent inappropriate responses.
NeMo integrates seamlessly with the NVIDIA Triton Inference Server to accelerate the inference process, delivering cutting-edge accuracy, low latency, and high throughput. As part of the NVIDIA AI Enterprise software suite, NeMo is backed by a team of dedicated NVIDIA experts providing unparalleled support and expertise.
TRT LLM
Once a LLM has been customized, it can be optimized for low latency and high throughput inference using NVIDIA Tensor RT. TensorRT-LLM is a toolkit to assemble optimized solutions to perform LLM inference. The Python API can be leveraged to define models and compile efficient TensorRT engines for NVIDIA GPUs. Additionally, Python and C++ components can be used to build runtimes to execute those engines. TensorRT-LLM supports multi-GPU and multi-node configurations (through MPI). TensorRT-LLM also includes Python and C++ backends for NVIDIA Triton Inference Server to assemble solutions for LLM online serving.
NVIDIA Generative AI Knowledge Base Q&A Blueprint
The NVIDIA Generative AI Knowledge Base Questions and Answering AI Blueprint is a reference example of how to use the aforementioned NVIDIA AI Enterprise components to build a Generative AI chatbot. By leveraging these components, the chatbot is able to accurately answer domain-specific questions, based on the enterprise knowledge-base entities.
The NVIDIA Generative AI Knowledge Base Questions and Answering AI Blueprint leverages an existing AI foundational model that is an open-source community LLM (llama2) and performs prompt-tuning (p-tuning). Adapting an existing AI foundational model is a low-cost solution that enterprises can leverage to accurately generate responses to your specific enterprise use case. Once the model has been p-tuned, the model is then chained to a vector database by using Langchain. This allows multiple LLM applications to talk to the LLM and the answers are based upon real Enterprise datasources.
This AI reference workflow contains:
- TensorRT LLM (TRT-LLM) for low latency and high throughput inference for LLMs
- Langchain and vector database
- Cloud Native deployable bundle packaged as helm charts
- Guidance on performing training and customization of the AI solution to fit your specific use case (example LoRA finetuning or p-tuning)
VMware Infrastructure for Gen AI
In keeping with the VMware+NVIDIA AI-Ready Enterprise strategy, VMware continues to drive technical innovations at the infrastructure layer to enable enterprises to train and deploy LLMs for Generative AI. The key technical recommendations for Generative AI deployment are provided within VMware’s Baseline Reference Architecture document and we look at a few of those new features here.
LLM customization and inference pipelines are orchestrated using Kubernetes. This is the orchestrator technology platform of choice for VMware and NVIDIA since many LLM tools and platforms make use of Kubernetes technology today. A combination of GPU-capable nodes and non-GPU-capable nodes in your Kubernetes cluster is recommeneded as well. The following screenshot illustrates a simpler version of a Gen AI cluster, when deployed on vSphere and VMware Cloud Foundation (VCF).
Tanzu Kubernetes Clusters (TKCs) provides Enterprises with the ability to create LLM workload clusters very quickly and adjust their size and content quickly as needed. This provides the flexibility to accomendate various data scientist teams, where each team needs separate ML toolkits or platform versions. Each can co-exist on the same hardware but each team has their own TKC as a self-contained sandbox. With suitable permissions, the devops or administrator user can create a new cluster with one “kubectl apply -f” command – and make changes to the clusters in a very similar way.
LLM Foundation models which may not fit into the space allowed by one GPU’s framebuffer, therefore LLM customization workloads typically require several GPUs attached to their worker VM. The “VM class” mechanism in vSphere makes this very easy to do. A VM can be assigned one, two or more virtual GPUs, using NVIDIA vGPU profiles. This type of multi-GPU arrangement is also very easy to do in VMware vSphere.
In the example scrrenshot below, an administrator or devops person is choosing a device group which will be added to a VM.
Figure 6 : Device Groups for Multiple GPUs with NVLink/NVSwitch viewed in vSphere 8
This device group can represent 2, 4 or 8 GPUs. Additionally, vSphere can now discover if those GPUs are using NVLink and NVSwitch at the hardware layer. These features make it very easy to construct a powerful VM that can handle very large model sizes.
GPUs which leverage NVSwitch/NVLink have a very high-bandwidth connection directly between all the GPUs. This allows up to 600 GB/second bidirectional bandwidth on Ampere-class GPUs and up to 900 GB/second bidirectional bandwidth on the Hopper-class (H100) GPUs. These levels of speed are needed when large models are being trained across the set of GPUs available in the HGX class of machines from vendors. The following diagram further illustrates NVSwitch/NVLink-based GPUs”
Multi-Node Training/Customization of Models
In some cases, the data science users will want to distribute the model customization across multiple servers with different VMs (referred to as “multi-node training”). This approach is required if a model cannot does not fit within the combined GPU memory of one servers collection of GPUs. The infrastructure required for multi-node training involves both GPUs and high-speed networking cards, like NVIDIA ConnectX-7 and Bluefield, which have been tested and validated on VMware vSphere or VMware Cloud Foundation. NVIDIA recommends 200Gb/s networking bandwidth between servers/nodes for multi-node training, particularly for east-west traffic since it carries the model gradients between the training participants. For north-south networking traffic, i.e. for inference requests coming in from outside the enterprise, the recommendation is to use the Bluefield technology, thus offloading many of the security functions from the CPU.
The following graphic illustrates a multi-node training compute node with four GPUs, ConnectX-7 and Bluefield high-speed networking.
For Kubernetes environments on VMware platforms, NVIDIA supplies the Network Operator as part of the NVIDIA AI Enterprise suite. The network operator eases the installation and ongoing management of the networking drivers (e.g., the MOFED, peer-to-peer drivers) onto the relevant Kubernetes nodes. More technical details on that multi-node, distributed setup are given in the Base Reference Architecture document from VMware.
Summary
This blog captures the key technologies from NVIDIA and VMware which together form a solid basis for your machine learning, LLM and generative AI developments. Together the two companies’ technologies offer a solid solution for LLM customization and model deployment for production inference.
We have reviewed a full suite of NVIDIA AI Enterprise software and frameworks for all the above from NVIDIA in this article. These depend on robust and scalable infrastructure to drive them. VMware Cloud Foundation and vSphere together provide the platform that is capable of quick deployment and rapid change that data scientists need to do their work effectively. VMware and NVIDIA are partnering to support you in this new and exciting field of generative AI.