VCF Compute (vSphere) Private AI VMware Cloud Foundation VMware Private AI VMware vSphere Foundation

New Tuning Settings for DRS and vMotion for VMs with Virtual GPUs in vSphere 8.0 Update 2

This article gives a technical preview of new features in vSphere DRS and vMotion that are available in vSphere 8 Update 2.

When a vMotion happens on a VM, the VM’s dirtied memory (RAM) pages are copied over to the destination machine in phases. One of the final phases in that process causes the VM to go through a period called the “stun time”, which is ideally not noticed by the guest operating system or the application running on it. At that point, the guest operating system is stopped for a short period while the final memory page copy takes place. vSphere always works with the VM to minimize that stun time. We describe some further enhancements to that process here in the context of VMs that have vGPUs assigned to them.

When a virtual GPU (vGPU) is assigned to a VM, that VM can also be subject to vMotion – for a variety of reasons. Remember that the GPU is operating asynchronously from the CPU. The GPU is offloading many math calculations and is managing data of its own that is separate from the CPU’s memory. In the vSphere Client for vSphere 8, you can assign multiple vGPU profiles to one VM – representing a full physical GPU each – or representing part of a GPU, too. It is not uncommon to see VMs with four or more full-memory profile vGPUs assigned to them today (i.e. occupying four physical GPUs fully).

A common reason for a vMotion of any VM is the need to place the ESXi host server into maintenance mode, for system upgrade or for new VIB/driver installation, for example. This evacuation of VMs from a host, using vMotion, happens automatically when you place the host into maintenance mode – including now those VMs that have vGPU profiles attached to them. You can find more information on automated migration when entering maintenance mode in this Knowledge Base article  There can be other reasons for a vMotion to occur too – for load balancing a cluster, for example. More details on that are here

The physical GPU device that is backing the vGPU profile associated with the VM has separate memory hardware of its own, called “framebuffer” memory. Some GPU models have 80 GB or more of framebuffer memory today, such the Hopper H100. Newer models of GPU have even more framebuffer memory than that. A VM on a host may have 2, 4 or 8 such full GPUs mapped to it through the vGPU and device group mechanisms that are mentioned here. Here is a view of a VM that has two full A100 40GB devices associated with it, through the vGPU profiles. These are time-sliced profiles that allow multiple GPUs to be assigned. A VM on vSphere 8 Update 2 can have up to 16 vGPU profiles (representing full physical GPUs) assigned to it. This is to accommodate very large training jobs, such as those done with large language models.

A screenshot of a computerDescription automatically generated

When a vMotion happens, the contents of the GPU’s framebuffer memory must be copied from the source GPU to the GPU on the destination host machine. We can therefore have a lot more active memory contents to deal with than just the VM’s own RAM. With this large amount of GPU framebuffer memory to be copied over to a new host, the bandwidth of our network connection between the two hosts becomes critical.

As a first enhancement in vSphere 8 Update 2 to help with this data transfer, there is now an estimate that is made by DRS, when it first places the VM on a host, of the amount of stun time this VM would need, based on the user-supplied network bandwidth (10 GigE or 100 GigE in the examples here). This stun time estimate is dependent on the available network bandwidth and the size of framebuffer memory that is allocated to the VM. The higher vMotion network bandwidth used below in the right-hand example can dramatically affect the vMotion stun time for your vGPU-aware VM.

A screenshot of a computerDescription automatically generated

Adjusting the Stun Time for a VM

You can now also set the vMotion Stun Time Limit that your workload can tolerate on a VM of this type, i.e., a VM that has one or more vGPU profiles assigned to it. That is seen in the lower left side of the “Edit Settings” on a VM in the vSphere Client below. When the vSphere administrator adjusts this setting, the vSphere vMotion algorithms will allow longer than the default of 100 seconds for vMotion and give the vMotion a better chance of copying everything over to the destination machine. This may take some experimentation in testing to get this value right.

vSphere will also calculate its own estimate of the amount of vMotion stun time needed, based on the memory sizes of your vGPU profiles, and let you know if you have allocated too little time. This is seen in the yellow colored area on the right side below:

A screenshot of a computerDescription automatically generated

Tuning Options for DRS Activity

We have also added a set of new, advanced DRS settings to give you much more control over the DRS behavior with your vGPU-aware VMs and clusters.

Due to the extended vMotion downtimes for vGPU VMs, vCenter has set Load Balancer recommendations to Manual Remediation till now. However, when a Cluster is properly configured and has a sufficient vMotion network, VMs with smaller vGPU profiles may freely migrate under the default 100 second timeout. If the stun time introduced by the vGPUs in your VM is acceptable, one can enable automatic DRS Load Balancing and Maintenance Mode evacuations with vSphere 8.0 Update 2. Note that DRS currently does not take the loading of the GPU itself into account in its decisions to move a VM from one host to another for load balancing, but rather considers the CPU and main memory consumption factors.

1. Consolidating “Smaller” VMs to Hosts

VgpuVmConsolidation = 1

When set to 1, this advanced setting causes DRS to tightly pack VMs with fewer full-memory vGPU profiles (e.g. 1 vGPU) onto hosts that have extra compatible GPU capacity. This allows for larger remaining GPU capacity on other hosts and thus allows VMs with multiple full-memory vGPU devices to power on. This scenario is described in this blog article

2. DRS Load Balancer and vMotion

LBMaxVmotionPerHost = 1

This advanced setting, when set to 1 on a DRS cluster, will cause DRS to allow one vMotion for consolidation of VMs per host on any one scan that DRS does. The functionality is to reduce the number of vMotions the Load Balancer will conduct to achieve consolidation goals.

3. Checks on vMotion Stun Time

PassthroughRequireDrs = 1

This advanced setting, when set to 1, performs a series of checks to ensure that the VM can be moved using a vMotion, and that its stun time during vMotion fits within the estimate calculated by DRS, at VM power-on time.

4. DRS Load Balancing and Maintenance Mode VM Evacuation

PassthroughDrsAutomation = 1

This option, when set to 1, enables automatic DRS Load Balancing and Maintenance Mode evacuations of VMs with vSphere 8.0 U2, *provided* the VM stun time introduced by the vGPUs is acceptable, according to DRS’ calculations of the required stun time.

5. Progressive De-fragmentation of GPU Resources

PassthroughForceDrsAutomation = 1

When this option is set to 1 on a DRS cluster, it will allow for Maintenance Mode Evacuation without consideration of the vMotion Stun Time Limit (100 seconds by default).

PassthroughDrsAutomation and PassthroughForceDrsAutomation can be used in conjunction with Host Maintenance Mode to progressively de-fragment GPU resources in a Cluster.

These new advanced options may be used together to give full DRS automation in vSphere 8 Update 2.

A screenshot of a computerDescription automatically generated

For more details on these options see KB article 88271 and KB article 66813 

Summary

With vSphere 8.0 Update 2, the system administrator can now see advice from DRS that longer vMotion stun times are needed for VMs that have vGPU profiles assigned to them – especially those vGPU profiles that capture larger GPU memory allocations.

With that guidance from vSphere, the administrator can make an educated decision on how long they wish to tolerate for vMotion stun time for any particular VM. This allows larger GPU-consuming jobs to continue, such as a machine learning training job, even when a vMotion is needed, without interruption. There is also an added set of advanced options to make DRS more accessible to you in its management of vGPU-aware virtual machines, helping to automate more processes around these GPU workloads on vSphere 8 Update 2.