Data scientists and machine learning developers are building and working with very large language models (LLMs) these days with more extensive GPU memory needs. Many of these larger ML applications need more than one NVIDIA GPU device on the vSphere servers on which they operate or they may need to communicate between separate GPUs over the local network. This can be done for the purpose of expanding the overall GPU framebuffer memory capacity available to the model – or for other reasons.
Servers now exist on the market with eight or more physical GPUs in them and that number of GPUs per server will likely grow over time. With vSphere 8, you have the capability to add up to 8 virtual GPUs (vGPUs) to one VM. You can have even more than that with Passthrough (DirectPath I/O) mode, but our focus in this article will be on vGPU – as that gives us vMotion capability and more control over the VM’s consumption. The capability to use multiple full physical GPUs for ML, represented as vGPUs in vSphere, is why customers now more frequently ask us about deploying multiple GPUs on a server – and how to best configure them. The expanded vGPU capability per VM, along with the new concept of “device groups” in vSphere 8 were implemented so as to tackle this customer need.
Figure 1: A simple example PCIe layout of NIC and GPU devices within a server
We start by looking at the ways in which multiple devices in servers are physically connected today. The physical device layout at the PCIe bus level (and at the inter-connect layer) can vary widely from one host server to another. GPUs or other devices can be located far from each other on the PCIe hierarchy or they could be close to one another. Communication between such devices is influenced heavily by their proximity to each other. This presents choices as to where a VM is placed by vSphere Distributed Resource Scheduler (DRS) at VM boot-up time. For this reason, choosing the most suitable physical server location for your VM to run on becomes very important. vSphere 8 gives the vSphere Client user much more automation and control over that choice of destination host, through the new concept of device groups.
An example physical server layout with two CPU sockets is shown in Figure 1 above. The NVIDIA NVLINK hardware (a communication device between GPUs) is shown on the lower left as an interconnect between two GPUs. You can read more detail on VMware’s performance tests done using GPUs with NVLINK here. The short summary is that NVLINK benefits your performance significantly with multiple GPUs in an NVLINK setup.
Secondly, the PCIe switch device setup on the right hand side of Figure 1 is becoming more common now in converged architectures, for faster GPU-to-NIC communication. The GPU-to-GPU connection over NVLINK represents scale-up of your GPU power to large models within a server, while the GPU-to-NIC communication is geared up for scale-out across multiple servers, for distributed training for example.
The decision on initial placement of a VM onto a particular ESXi host is made by vSphere DRS – with minimal need for administrator intervention. We look at how DRS in vSphere 8 makes its decisions about VM placement onto a host, when device groups are specified, in this article.
Up to now, the vSphere Client user has had very limited control over DRS’ decisions on initial VM placement onto a suitable host server. Device groups in vSphere 8 gives the user visibility to a collection of PCIe devices that are paired with each other at the hardware level, either using NVLINK or through a common PCIe switch.
This vSphere 8 device groups feature allows a set of PCIe devices to be presented to a VM as one unit for use in VM placement decisions by DRS in vSphere. The VM creator is establishing a contract, in adding a device group to the VM, that informs the DRS placement algorithm what it needs to see in the hardware for satisfying this VM’s needs. Two examples of separate device groups in vSphere 8 are shown in the familiar “Add a PCIe device to a VM” dialogs as seen in the vSphere Client below.
Figure 2 : Two examples of different device groups in the vSphere 8 Client
A device group in vSphere 8 can be composed of a pair of GPUs that have one or more NVLINKs between them, allowing more efficient communication between them than via the PCIe bus. Notice that the GPUs are represented here within the device group names using their vGPU profiles. This dual-GPU device group is shown, as discovered automatically in the hardware by the NVIDIA host vGPU driver, in the first entry in the vSphere Client’s Device Selection dialog in figure 2 above. The “2@” in the device group name signifies two physical GPUs, represented as vGPUs. Note that the vGPU profile used in the device group name is a full-memory allocation, time-sliced one. A subset of the GPU memory in a device group specification is not allowed. Multi-instance GPU (MIG) profiles are not supported for device groups in vSphere 8. For now, the assumption is that if you are using the GPU in a device group, then you are allocating all of that device’s capacity for this VM.
A device group can also be made up of four physical GPUs on a host server (again, represented as vGPUs) that all have NVLINK connections between them. This last example is not shown in the above screen. Figure 3 shows an outline of the two forms of NVLINK connections we have mentioned. In the second example on the right below, each GPU is connected over NVLINKs to three other GPUs on the same server.
Figure 3 – NVLINK connections between two and four physical GPUs in a server
Separately, a device group can be composed of a GPU device and a Network Interface Card that share the same PCIe Switch for communicating together, as seen in the second entry in the vSphere Client list above.
The idea is that you would want those sets of devices, contained in a group, to be allocated together to your VM – instead of allocating a random available GPU or network card from just anywhere on your vSphere cluster. Device groups may only contain “full profile” time-sliced vGPUs. There is no partial GPU capability in a device group in vSphere 8.
How Device Groups Differ from Earlier Methods
In vSphere 7 and earlier, you choose to add a PCIe device to your VM, such as a GPU or a network card like an NVIDIA Connect-x5 or Connect-X6. These devices can also be seen from within Kubernetes pods that run within VMs, but let’s focus on the VMs, to begin with. In vSphere 7, the system administrator had to identify particular GPUs by their PCIe bus identity, and discover by hand that they had NVLINK connections, in order to identify them specifically together as a pair to one VM. This choice of linked devices is made much simpler for the vSphere 8 user – with much more intelligence about these device pairings in the hardware.
Before vSphere 8.0, you chose to “Add a PCIe device” to your VM at configuration time in the vSphere Client. At this point, the guest OS in the VM is not booted and the VM is powered off. You can add a GPU, one at a time, either in Passthrough mode or in vGPU mode, up to a maximum limit.
For vSphere 8.0, this limit is 8 vGPUs (and thus 8 full physical GPUs) allocated to one VM.
You can also add a single SR-IOV virtual function (VF) from a Connect-X6 card in Passthrough mode to your VM. You choose the device in the “Edit Settings – Add New Device – PCIe Device” dialog in the vSphere Client. Notice however, that we say “the particular device” – indicating just one device to be added at a time. That single device addition capability is enhanced in vSphere 8 with the addition of a device group – using the same dialog sequence. So, you don’t have to change the way you operate today to get to the device groups.
With vSphere 8.0, we can allocate two individual GPUs to a VM separately, in vGPU mode. But the problem is that on a server with for example, 4 GPU devices, which GPUs are the ones chosen to add to my VM? If I were to choose one GPU that happens to have no NVLINK connection to my second GPU, then I have potentially a sub-optimal setup for my application. Prior to vSphere 8.0, I had to do a lot of manual device discovery to exercise control over this NVLINKed GPU decision. In the simpler cases, DRS chose the device for me – and I could only reference one GPU/NIC device at one time. That now changes in 8.0 to handle more demanding scenarios.
vSphere 8.0 with Device Groups
Figure 4: Device groups for (a) two GPUs with NVLink and (b) a GPU and a NIC on the same PCIe switch
The device group capability in vSphere 8 solves the problem of pairing of devices for best performance – as they get assigned to a VM, and as the VM gets placed onto appropriate hardware at VM boot-up time. The two kinds of device group above are discovered in the hardware automatically by vSphere and they are presented as options when you add a PCIe device to your VM. This device group is shown in the second entry in the Device Group selection dialog in Figure 2 above.
You can think of this as
(a) scaling up the power of your VM to two or more physically linked GPUs. This is the NVLINK case between two or four GPUs on one host, where NVLINK gives dedicated data bandwidth between the GPUs,
and
(b) scaling out your application to multiple host servers, where the GPU-to-NIC communication within each host is traveling over the fastest path inside the host server to get the data in and out of that host. That is, the GPU and NIC are in a device group – and what that means is that the physical GPU and NIC card are connected to the same PCIe switch.
User Interface for Adding a Device Group to a VM
The mechanism for expressing a VM’s need for a device group is very familiar to those who have added single PCIe devices to a VM using the vSphere Client in the past. In the “Edit Settings -Add New Device – Other Devices – PCIe Device” section of the vSphere Client we are presented with a list of the available individual devices and device groups on the servers in that cluster. These device groups are discovered by the NVIDIA GPU Manager (the NVIDIA host driver installed in the vSphere kernel) and presented to vCenter and the vSphere Client for display.
From there, we can choose from the list of devices and device groups for specification to the VM. This list can include the applicable vGPU profiles that refer to the GPU model, and virtual functions along with the device groups seen earlier. The vSphere Client user assigns one or more of the device group entries to the VM.
Virtual Machine Placement
DRS in vSphere is responsible for VM placement onto a particular host. DRS operates at the vSphere cluster level and decides which hosts are candidates for hosting a VM based on resource availability and a set of compatibility checks.
DRS determines whether a host can fulfill the contract that is expressed by the VM when it requests a single device or a device group of a particular nature. It does this as part of its compatibility checks. When those checks are done, DRS then places the VM onto one of the chosen hosts. If a suitable device group in the server hardware within the cluster cannot be found to fulfill the request, then the VM will not be started.
vMotion and Device Groups
When a vMotion of a VM with an associated device group is done, the target host for the move is again examined by DRS to determine whether that target host can fulfill the device group requirement. If that test succeeds for available hardware on the destination server, then the vMotion process will proceed.
Summary
Users of the vSphere 8.0 Client now have the capability to express very precisely which device groups in the server hardware are needed on a host for their VMs. By placing the VM onto a hardware configuration that is ideal for its needs, as expressed in the device group contract, vSphere delivers the best performing infrastructure to support the application within its VMs.