Sharing GPU for Machine Learning/Deep Learning on VMware vSphere with NVIDIA GRID: Why is it needed? And How to share GPU?

By Lan Vu, Uday Kurkure, and Hari Sivaraman

Data scientists may use GPUs on vSphere that are dedicated to use by one virtual machine only for their modeling work, if they need to. Certain heavier machine learning workloads may well require that dedicated approach. However, there are also many ML workloads and user types that do not use a dedicated GPU continuously to its maximum capacity. This presents an opportunity for shared use of a physical GPU by more than one virtual machine/user. This article explores the performance of a shared-GPU setup like this, supported by the NVIDIA GRID product on vSphere, and presents performance test results that show that sharing is a feasible approach. The other technical reasons for sharing a GPU among multiple VMs are also described here. The article also gives best practices for determining how the sharing of a GPU may be done.

VMware vSphere supports NVIDIA GRID technology for multiple types of workloads. This technology virtualizes GPUs via a mediated passthrough mechanism. Initially, NVIDIA GRID supported GPU virtualization for graphics workloads only. But, since the introduction of Pascal GPU, NVIDIA GRID has supported GPU virtualization for both graphics and CUDA/machine learning workloads. With this support, multiple VMs running GPU-accelerated workloads like machine learning/deep learning (ML/DL) based on TensorFlow, Keras, Caffe, Theano, Torch, and others can share a single GPU by using a vGPU provided by GRID. This brings benefits in multiple use cases that we discuss on this post.

Each vGPU is allocated a dedicated amount of GPU memory and a vGPU profile specifies how much device memory each vGPU has and maximum number of vGPUs per physical GPU. For example, if you choose the P40-1q vGPU profile for Pascal P40 GPU, you can have up to 24 VMs with vGPU because P40 has total of 24 GB device memory. More information about virtualized GPUs on vSphere can be found at our previous blog here.

Figure 1: NVIDIA GRID vGPU

Why do we need to share GPUs?

Sharing GPUs can help increase system consolidation, resource utilization, and save deployment costs of ML/DL workloads. GPU-accelerated ML/DL workloads include training and inference tasks, and their GPU usage patterns are different. Training workloads are mostly run by data scientists and machine learning engineers during the research and development phase of an application. Because model training is just one of many tasks of ML application development, the need of GPUs by each user is usually irregular. For example, a data scientist does not spend the whole workday just training models because he/she has other things to do like checking & answering emails, attending meetings, researching and developing new ML algorithms, collecting and cleaning data, and so on. Hence, sharing GPUs among a group of multiple users helps increase the GPU utilization while not reducing much the performance benefits of GPU.

To illustrate this scenario of using GPU for training, we conducted an experiment in which 3 VMs (or 3 users) used vGPU to share a single NVIDIA P40 GPU, and each VM ran the same ML/DL training workload at different times. ML workloads inside VM1 and VM2 were run at the times t1 and t2, so that about 25% of the GPU execution time of VM1 and VM2 were overlapped. VM3 ran its workload at t3, and it was the only GPU-based workload run at that timeframe. Figure 2 depicts this use case in which the black dash arrows indicate VMs access GPU concurrently. If you run your applications inside container, please also check out our previous blog post on running container-based applications inside a VM.

Figure 2: A use case of running multiple ML jobs on VMs with vGPUs

In our experiments, we used CentOS VMs with P40-1q vGPU profiles, 12 vCPUs, 60 GB memory, 96 GB disk, and ran TensorFlow-based training loads on those VMs, including complex language modeling using a recurrent neural network (RNN) with 1500 long short-term memory (LSTM) units per layer, on the Penn Treebank dataset (PTB) [1, 2], and handwriting recognition using a convolution neural network (CNN) with a MNIST dataset [3]. We ran the experiment on a Dell PowerEdge R740 with dual 18-core Intel Xeon Gold 6140 sockets and an NVIDIA Pascal P40 GPU.

Figure 3 and Figure 4 show the normalized training time of VM1, VM2, and VM3 in which VM1 and VM2 have a performance impact of 16%–23%, while VM3 has no impact on the performance. In this experiment, we used the Best Effort scheduler of GRID which means VM3 fully utilized the GPU time during its application execution.

Figure 3: Training time of Language Modeling

Figure 4: Training time of Handwriting Recognition

For inference workloads, the performance characteristics can vary based on the usage frequency of the GPU-based applications on the production environment. Less intensive GPU workloads allow more more apps running inside VMs sharing a single GPU. For example, a GPU-accelerated database app and other ML/DL apps can share the same GPUs on the same vSphere host if their performance requirements are still met.

How many vGPU per physical GPU is good?

The decision of sharing GPU among ML/DL workloads running on multiple VMs and how many VMs per physical GPU depends on the GPU usage of ML applications. When users or applications do not use the GPU very frequently, as shown in the previous example, sharing the GPU can bring huge benefits because it significantly reduces the hardware, operation, and management costs. In this case, you can assign more vGPU per physical GPU. If your workloads use GPU intensively and require continuous access to the GPU, sharing it can still bring some benefits because GPU-based application execution includes CPU time, GPU time, I/O time, and so on. Additionally, sharing a GPU helps fill the gap when applications spend time on CPU or I/O. However, in this case, you need to assign fewer vGPUs per physical GPU.

To determine how many VMs with vGPU per physical GPU are needed, you can base this on your evaluation of usage frequency or the GPU utilization history of the applications. In the case of GRID GPU on vSphere, you can monitor GPU utilization information by using the command nvidia-smi on the vSphere hypervisor.

We evaluated the performance of ML/DL workloads, in the worst case, when all VMs use a GPU at the same time. To do this, we ran the same MNIST handwriting recognition training on multiple VMs with each vGPU concurrently sharing a single Pascal P40 GPU. Each VM had a P40-1q vGPU.

The experiment in this scenario is depicted in Figure 5 with the number of concurrent VMs in our test ranging from 1 to 24 VMs.

Figure 5: Running multiple ML jobs on VMs with vGPUs concurrently

Figure 6 presents the normalized training time of this experiment. As the number of concurrent ML jobs increases, the training time of each job also increases because they share a single GPU. However, the increase of time is not as fast as the increase of VM. For example, when we have 24 VMs run concurrently, the execution time increases, at most, 17 times instead of 24 times or higher. This means that even in the worst case, where all VMs use the GPU at the same time, we still see the benefits of GPU sharing. Please note that in the typical use case of training as mentioned earlier, not all users or applications use the GPU 24/7. If they do, you can just reduce the number of vGPUs per GPU until the expected performance and consolidation are reached.

Figure 6: Training time with different number of VM

vGPU scheduling

When all VMs with GPU loads run concurrently, NVIDIA GRID manager schedules the jobs into the GPU based on time slicing. NVIDIA GRID supports three vGPU scheduling options: Best Effort, Equal Share, and Fixed Share. The selection of a vGPU scheduling option depends on use cases. The Best Effort scheduler allocates GPU time to VMs in a round-robin fashion. In the above experiments, we used the Best Effort scheduler. For some circumstances, a VM running a GPU-intensive application may affect the performance of a GPU-lightweight application running in other VMs. To avoid such performance impact and ensure quality of service (QoS), you can choose to switch to the Equal Share or Fixed Share scheduler. The Equal Share scheduler ensures equal share of GPU time for each powered-on VM. The Fixed Share scheduler gives a fixed share of GPU time to a VM based on the vGPU profile that is associated with each VM on the physical GPU.

For performance comparison, we run the MNIST handwriting recognition training load using different schedulers: Best Effort and Equal Share for different number of VMs.

Figure 7 presents the normalized training time and Figure 8 presents GPU utilization. As the number of VMs increase, Best Effort shows better performance because when a VM does not use its time slice, that time slice will be assigned to another VM that needs GPU. Meanwhile, for Equal Share, that time slice is always reserved for the VMs even if they do not utilize GPU at that moment. Therefore, Best Effort Scheduler has better GPU utilization as shown in Figure 7.

Figure 7: Training time of Best Effort vs. Equal Share

Figure 8: GPU utilization of Best Effort vs. Equal Share

Takeaways

Sharing a GPU among VMs using NVDIA GRID can help increase the consolidation of VMs with vGPU and reduce the hardware, operation, and management costs.
The performance impact of sharing a GPU is small in typical use cases when the GPU used is infrequently by users.
Choosing how many vGPUs per GPU is based on the ML/DL real load. For infrequent and lightweight GPU workloads, you can assign multiple vGPUs per GPU. For workloads that frequently use GPU, you should lower the number of vGPUs per GPU until the performance requirement is met.

Acknowledgments

We would like to thank Aravind Bappanadu, Juan Garcia-Rovetta, Bruce Herndon, Don Sullivan, Charu Chaubal, Mohan Potheri, Gina Rosenthal, Justin Murray, Ziv Kalmanovich for their support of this work and thank Julie Brodeur for her help in reviewing and recommendations for this blog post.

References

[1] Wojciech Zaremba, Ilya Sutskever, Oriol Vinyals, “Recurrent Neural Network Regularization,” In arXiv:1409.2329, 2014.

[2] Ann Taylor, Mitchell Marcus, Beatrice Santorini, “The Penn Treebank: An Overview, Treebanks: the state of the art in syntactically annotated corpora.” ed. / Anne Abeille. Kluwer, 2003.

[3] Yann LeCun, L. Bottou, Y. Bengio, and P. Haffner. “Gradient-based learning applied to document recognition.” in Proceedings of the IEEE, 86(11):2278-2324, November 1998.