By Hari Sivaraman, Uday Kurkure, and Lan Vu
NVIDIA Ampere-based GPUs [1, 2] are the latest generation of GPUs from NVIDIA. NVIDIA Ampere GPUs on VMware vSphere 7 Update 2 (or later) can be shared among VMs in one of two modes: VMware’s virtual GPU (vGPU) mode or NVIDIA’s multi-instance GPU (MIG) mode. NVIDIA vGPU software is included in the NVIDIA AI Enterprise suite, which is certified for VMware vSphere.
In vGPU mode, the memory on the GPU is statically partitioned, but the compute capability is time-shared among the VMs that share the GPU. In this mode, when a VM is running on the GPU, it “owns” all the compute capability of the GPU but only has access to its share of GPU memory.
In MIG mode, the memory and computational capability are statically partitioned. When a VM uses a GPU in MIG mode, it can only access the memory assigned to it and only use the computational cores assigned to it. So, even if the remaining computational cores (that is, the cores not assigned to this VM) in the GPU are idle, the VM cannot use those idle cores.
Regardless of which mode a VM uses to execute its workload, the computational results will be the same. The only difference will be in the performance, measured using wall-clock time, achieved. Both the vGPU and MIG modes have their respective advantages and drawbacks: the vGPU mode time-shares the computational cores, whereas the MIG mode statically partitions the cores. Given this difference in how cores are shared by these two modes, it raises the question of which mode delivers the best performance (that is, the lowest run time) for a given workload. We attempt to answer this question in a series of blogs of which this is the first part.
We ran unit tests (see table 1) inspired by the classic CUDA matrix multiplication example, with different ratios of computation to data transfer from host to GPU and back, to determine the criteria for what workloads would show better performance on MIG mode and which ones would do better in vGPU mode. Our test used ten matrices each of size 1000X1000.
The results presented in this blog show that workloads that execute heavy, large computational CUDA kernels with no interruptions for data transfers or CPU computations show better performance in MIG mode than in vGPU mode. But when the CUDA kernels are interspersed with data transfers or interruptions to execute CPU computations, vGPU mode offers better performance. So, the workload characteristics determine which GPU mode, vGPU or MIG, will deliver the best performance. In subsequent blogs, we’ll present results from real-world applications to show that the conclusions reached using unit tests apply to actual workloads.
Test Environment, Code, and Results
The tests were run on a Dell R740 (2 Intel Xeon Gold 6140 CPUs, 768 GB RAM, SSD storage) with 2 A100 GPUs. One GPU was configured in vGPU mode and the second was configured in MIG mode. Pseudocode for the first test is shown in figure 1 (above). The test was run with different ratios of data transfer to compute. The ratio values we tested were 0% data transfer, 10% data transfer, 20% data transfer, and so on in steps of ten to 100% data transfer. For each ratio of data transfer to compute, we varied the number of VMs running concurrently from one to the maximum allowed for that vGPU or MIG profile. For example, with MIG profile a100-1-5c (which denotes a profile with one compute slice and 5 GB GPU memory), a maximum of seven VMs can share the GPU. So, with this profile, we varied the number of VMs from one through seven. MIG profile a100-2-10c can support a maximum of 3 VMs. So, for this profile, we varied the number of VMs from one through three. The run time for executing CUDA computations for this test with 10% data transfer to 90% compute activity is shown in figure 2 (below). From the figure, we can see that when data transfers are interspersed with computations and the ratio of data transfers is 10%, vGPU mode offers better performance for the CUDA computations.
In figure 3 (below), we show results from this test with a ratio of 50% data transfers to 50% compute. The graphs show the run time for the entire test, and for the computations only. In figures 2 and 3, we can see that vGPU outperforms MIG mode when data transfers are interspersed with computation and the ratio of data transfers is 50%. From the data at other values of the ratio of data transfers to computation (not shown here), vGPU outperforms MIG mode whenever CUDA computations are interspersed with data transfers regardless of the ratio of data transfers to CUDA computation.
In a second set of tests, we changed the code so that all the data transfer operations were completed before we ran any CUDA computations (see the pseudocode in figure 4, above). For this type of workload, MIG mode outperforms vGPU with up to about three VMs running concurrently. With more than three VMs running concurrently, there is a crossover point, and vGPU mode delivers better performance at higher consolidation (that is, with more than three VMs). Figure 5 shows data from this test. The data is truncated at four VMs to highlight the crossover point.
In a final set of tests, we changed the code so that only CUDA computations were executed. The pseudocode is shown in figure 6, below.
We ran this test with all vGPU and MIG profiles. The results shown in figure 7 (below) clearly show that MIG mode offers better performance at almost all configurations and profiles with this workload.
Conclusion
NVIDIA A100 and A30 Tensor Core GPUs (A30 GPUs will be supported in an upcoming release of vSphere) on VMware vSphere supports sharing a GPU among many VMs using two modes: vGPU and MIG. In vGPU mode, memory is statically partitioned, but the CUDA computational cores are time-shared. In MIG mode, memory and the CUDA computational cores are statically partitioned. This difference in how CUDA computational cores are partitioned can cause differences in the performance achieved using vGPU mode compared to MIG mode for the exact same workload.
In this blog, we presented data from unit tests to show which workloads would show the best performance in each mode. vGPU mode shows the best performance, measured using wall-clock time, to complete the task for workloads with data transfers and/or CPU computations interspersed with CUDA computations. MIG mode shows the best performance for workloads that execute heavy, large CUDA kernels with little or no interruption for data transfers or CPU computations. For workloads with aggregated data transfers and aggregated CUDA computations, MIG mode shows the best performance for two or fewer VMs running concurrently, whereas the vGPU mode shows the best performance with three or more VMs running concurrently.
References
- NVIDIA Ampere Architecture
https://www.nvidia.com/en-us/data-center/ampere-architecture/ - NVIDIA Ampere GA102 GPU Architecture
https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf