Some of you may have recent experience with the power of very large machine learning models recently by trying the hugely popular ChatGPT from OpenAI. This chatbot is backed by an ML model, based on GPT-3 or later, with billions of parameters. That size of ML model takes many GPUs to train and is a taste of what is coming to the enterprise in AI/ML. Data scientists in enterprises are exploring models that may not yet be at this huge scale. But they are certainly seeing increasingly large models as measured by the number of model parameters, demanding more acceleration and more GPU power.
In a previous post, we talked about the concept of “vendor device groups” in VMware vSphere 8 for presenting a set of 2 to 4 GPUs on a server that are in an NVLink setup to a VM. One important use case for this is large ML model training, but it can also be put to other uses, such as analytics in HPC. In that article, we saw NVLink bridge hardware for those connections as shown here.
In this article, we take the device group concept to the next level, by exploring sets of 8 GPUs that are connected using NVSwitch technology – and presenting them, as one item, to a VM.
NVIDIA NVSwitch hardware comes either in a chip format or as a separate unit that can stand independently outside of a server, as a component of a rack, for example. We will talk about the use of NVSwitch technology within one server in this article.
Here is an internal view of a server that has 8 GPUs in it with six NVSwitch devices linking them together.
This multiple-GPU setup is becoming more common now in servers that NVIDIA supports in the “HGX” form. With HGX, NVIDIA supplies the baseboard that holds 8 A100 or H100 GPUs and six NVSwitches linking them together. Here, each GPU in the system has a very high bandwidth link to every other GPU, far surpassing the bandwidth capabilities of the PCIe bus that they still connect to. This is represented as shown below.
The NVLink ports on each GPU are used to connect them to the NVSwitch devices – rather than directly to other GPUs, as is done with NVLink bridge hardware.
With NVSwitch, the data is being routed from any one GPU to any other GPU in the setup, directly. With NVLink on NVIDIA Hopper architecture GPUs, a pair of GPUs can send 450 GB per second in each direction, giving a total bandwidth of 900 GB/second. The unidirectional bandwidth is 300 GB/sec for A100 GPU, giving 600 GB/s total bidirectional bandwidth there.
To make it easier for the administrator to use this kind of setup with VMware vSphere 8 update 1, that full set of 8 GPUs and known subsets of the 8 are presented as device groups, when we choose to add a PCIe device to a VM. The set of device groups is seen in the vSphere Client interface below. We used the earlier A100 40GB model in our lab tests here.
The vSphere Client user chooses between the various device groups shown above to give the VM the appropriate amount of GPU power. The device groups shown here offer 2, 4 and 8 full 40c profile vGPUs to the VM, all of which are operating over connections via the NVSwitch. The number after the word “Nvidia” in the name field indicates the number of GPUs in that device group. Note that the string after the “@” in the device group Name field is a vGPU profile denoting the full memory from that GPU. This means that the benefits of the vGPU approach apply here, including the correct VM placement onto the most suitable host.
These groups of GPUs are automatically discovered by the NVIDIA vGPU host driver software that is installed into vSphere. Once that host driver software is running, the administrator can use one of the allowed device groups for a VM, at configuration time.
In the VM’s guest operating system, the nvidia-smi tool shows all the GPUs that are available to it. The example below shows a setup where four GPUs are chosen to be allocated as one device group to a VM. Since these were allocated as a device group, we know that there is NVSwitch technology connecting them together.
The application developer can now proceed to make use of all four of these GPUs for training their model in TensorFlow, PyTorch or other platform of their choosing.
Summary
High-end machine learning models can span multiple physical GPUs because of the larger memory needs of their ML models and the need for more compute power. VMware and NVIDIA together now enable data scientists to train their models across up to 8 GPUs on a single VM at once, where those GPUs are communicating over high-bandwidth NVSwitch/NVLink connections. This demonstrates the joint commitment of VMware and NVIDIA together to serving the needs of the most demanding machine learning and high-performance computing applications.