Home > Blogs > VMware VROOM! Blog > Tag Archives: ML

Tag Archives: ML

Sharing GPU for Machine Learning/Deep Learning on VMware vSphere with NVIDIA GRID: Why is it needed? And How to share GPU?

By Lan Vu, Uday Kurkure, and Hari Sivaraman 

Data scientists may use GPUs on vSphere that are dedicated to use by one virtual machine only for their modeling work, if they need to. Certain heavier machine learning workloads may well require that dedicated approach. However, there are also many ML workloads and user types that do not use a dedicated GPU continuously to its maximum capacity. This presents an opportunity for shared use of a physical GPU by more than one virtual machine/user. This article explores the performance of a shared-GPU setup like this, supported by the NVIDIA GRID product on vSphere, and presents performance test results that show that sharing is a feasible approach. The other technical reasons for sharing a GPU among multiple VMs are also described here. The article also gives best practices for determining how the sharing of a GPU may be done.

VMware vSphere supports NVIDIA GRID technology for multiple types of workloads. This technology virtualizes GPUs via a mediated passthrough mechanism. Initially, NVIDIA GRID supported GPU virtualization for graphics workloads only. But, since the introduction of Pascal GPU, NVIDIA GRID has supported GPU virtualization for both graphics and CUDA/machine learning workloads. With this support, multiple VMs running GPU-accelerated workloads like machine learning/deep learning (ML/DL) based on TensorFlow, Keras, Caffe, Theano, Torch, and others can share a single GPU by using a vGPU provided by GRID. This brings benefits in multiple use cases that we discuss on this post.  

Each vGPU is allocated a dedicated amount of GPU memory and a vGPU profile specifies how much device memory each vGPU has and maximum number of vGPUs per physical GPU. For example, if you choose the P40-1q vGPU profile for Pascal P40 GPU, you can have up to 24 VMs with vGPU because P40 has total of 24 GB device memory. More information about virtualized GPUs on vSphere can be found at our previous blog here. 

Figure 1: NVIDIA GRID vGPU 

Why do we need to share GPUs?

Sharing GPUs can help increase system consolidation, resource utilization, and save deployment costs of ML/DL workloads. GPU-accelerated ML/DL workloads include training and inference tasks, and their GPU usage patterns are different. Training workloads are mostly run by data scientists and machine learning engineers during the research and development phase of an application. Because model training is just one of many tasks of ML application development, the need of GPUs by each user is usually irregular. For example, a data scientist does not spend the whole workday just training models because he/she has other things to do like checking & answering emails, attending meetings, researching and developing new ML algorithms, collecting and cleaning data, and so on. Hence, sharing GPUs among a group of multiple users helps increase the GPU utilization while not reducing much the performance benefits of GPU.  

To illustrate this scenario of using GPU for training, we conducted an experiment in which 3 VMs (or 3 users) used vGPU to share a single NVIDIA P40 GPU, and each VM ran the same ML/DL training workload at different times. ML workloads inside VM1 and VM2 were run at the times t1 and t2, so that about 25% of the GPU execution time of VM1 and VM2 were overlapped. VM3 ran its workload at t3, and it was the only GPU-based workload run at that timeframe. Figure 2 depicts this use case in which the black dash arrows indicate VMs access GPU concurrently. If you run your applications inside container, please also check out our previous blog post on running container-based applications inside a VM.

Figure 2: A use case of running multiple ML jobs on VMs with vGPUs

In our experiments, we used CentOS VMs with P40-1q vGPU profiles, 12 vCPUs, 60 GB memory, 96 GB disk, and ran TensorFlow-based training loads on those VMs, including complex language modeling using a recurrent neural network (RNN) with 1500 long short-term memory (LSTM) units per layer, on the Penn Treebank dataset (PTB) [1, 2], and handwriting recognition using a convolution neural network (CNN) with a MNIST dataset [3]. We ran the experiment on a Dell PowerEdge R740 with dual 18-core Intel Xeon Gold 6140 sockets and an NVIDIA Pascal P40 GPU.  

Figure 3 and Figure 4 show the normalized training time of VM1, VM2, and VM3 in which VM1 and VM2 have a performance impact of 16%–23%, while VM3 has no impact on the performance. In this experiment, we used the Best Effort scheduler of GRID which means VM3 fully utilized the GPU time during its application execution.

Figure 3Training time of Language Modeling 

Figure 4: Training time of Handwriting Recognition 

For inference workloads, the performance characteristics can vary based on the usage frequency of the GPU-based applications on the production environment. Less intensive GPU workloads allow more more apps running inside VMs sharing a single GPU. For example, a GPU-accelerated database app and other ML/DL apps can share the same GPUs on the same vSphere host if their performance requirements are still met.

How many vGPU per physical GPU is good?

The decision of sharing GPU among ML/DL workloads running on multiple VMs and how many VMs per physical GPU depends on the GPU usage of ML applications. When users or applications do not use the GPU very frequently, as shown in the previous example, sharing the GPU can bring huge benefits because it significantly reduces the hardware, operation, and management costs. In this case, you can assign more vGPU per physical GPU. If your workloads use GPU intensively and require continuous access to the GPU, sharing it can still bring some benefits because GPU-based application execution includes CPU time, GPU time, I/O time, and so on. Additionally, sharing a GPU helps fill the gap when applications spend time on CPU or I/O. However, in this case, you need to assign fewer vGPUs per physical GPU.

To determine how many VMs with vGPU per physical GPU are needed, you can base this on your evaluation of usage frequency or the GPU utilization history of the applications. In the case of GRID GPU on vSphere, you can monitor GPU utilization information by using the command nvidia-smi on the vSphere hypervisor. 

We evaluated the performance of ML/DL workloads, in the worst case, when all VMs use a GPU at the same time. To do this, we ran the same MNIST handwriting recognition training on multiple VMs with each vGPU concurrently sharing a single Pascal P40 GPU. Each VM had a P40-1q vGPU.  

The experiment in this scenario is depicted in Figure 5 with the number of concurrent VMs in our test ranging from 1 to 24 VMs.  

Figure 5: Running multiple ML jobs on VMs with vGPUs concurrently 

Figure 6 presents the normalized training time of this experiment. As the number of concurrent ML jobs increases, the training time of each job also increases because they share a single GPU. However, the increase of time is not as fast as the increase of VM. For example, when we have 24 VMs run concurrently, the execution time increases, at most, 17 times instead of 24 times or higherThis means that even in the worst case, where all VMs use the GPU at the same time, we still see the benefits of GPU sharing. Please note that in the typical use case of training as mentioned earliernot all users or applications use the GPU 24/7. If they do, you can just reduce the number of vGPUs per GPU until the expected performance and consolidation are reached

Figure 6: Training time with different number of VM

vGPU scheduling

When all VMs with GPU loads run concurrently, NVIDIA GRID manager schedules the jobs into the GPU based on time slicing. NVIDIA GRID supports three vGPU scheduling options: Best Effort, Equal Share, and Fixed Share. The selection of a vGPU scheduling option depends on use cases. The Best Effort scheduler allocates GPU time to VMs in a round-robin fashion. In the above experiments, we used the Best Effort scheduler. For some circumstances, a VM running a GPU-intensive application may affect the performance of a GPU-lightweight application running in other VMs. To avoid such performance impact and ensure quality of service (QoS), you can choose to switch to the Equal Share or Fixed Share scheduler. The Equal Share scheduler ensures equal share of GPU time for each powered-on VM. The Fixed Share scheduler gives a fixed share of GPU time to a VM based on the vGPU profile that is associated with each VM on the physical GPU.  

For performance comparison, we run the MNIST handwriting recognition training load using different schedulers: Best Effort and Equal Share for different number of VMs.  

Figure 7 presents the normalized training time and Figure 8 presents GPU utilization. As the number of VMs increase, Best Effort shows better performance because when a VM does not use its time slice, that time slice will be assigned to another VM that needs GPU. Meanwhile, for Equal Share, that time slice is always reserved for the VMs even if they do not utilize GPU at that moment. Therefore, Best Effort Scheduler has better GPU utilization as shown in Figure 7. 

Figure 7: Training time of Best Effort vs. Equal Share

Figure 8: GPU utilization of Best Effort vs. Equal Share 

Takeaways

  • Sharing a GPU among VMs using NVDIA GRID can help increase the consolidation of VMs with vGPU and reduce the hardware, operation, and management costs. 
  • The performance impact of sharing a GPU is small in typical use cases when the GPU used is infrequently by users. 
  • Choosing how many vGPUs per GPU is based on the ML/DL real load. For infrequent and lightweight GPU workloads, you can assign multiple vGPUs per GPU. For workloads that frequently use GPU, you should lower the number of vGPUs per GPU until the performance requirement is met.  

Acknowledgments

We would like to thank Aravind Bappanadu, Juan Garcia-Rovetta, Bruce Herndon, Don Sullivan, Charu Chaubal, Mohan Potheri, Gina Rosenthal, Justin Murray, Ziv Kalmanovich for their support of this work and thank Julie Brodeur for her help in reviewing and recommendations for this blog post.

References

[1] Wojciech Zaremba, Ilya Sutskever, Oriol Vinyals, “Recurrent Neural Network Regularization,” In arXiv:1409.2329, 2014. 

[2] Ann Taylor, Mitchell Marcus, Beatrice Santorini, “The Penn Treebank: An Overview, Treebanks: the state of the art in syntactically annotated corpora.” ed. / Anne Abeille. Kluwer, 2003.  

[3] Yann LeCun, L. Bottou, Y. Bengio, and P. Haffner. “Gradient-based learning applied to document recognition.” in Proceedings of the IEEE, 86(11):2278-2324, November 1998.  

 

 

VMware Speedily Resolves Customer Issues in vSAN Performance Using AI

We in VMware’s Performance team create and maintain various tools to help troubleshoot customer issues—of these, there is a new one that allows us to more quickly determine storage problems from vast log data using artificial intelligence. What used to take us days, now takes seconds. PerfPsychic analyzes storage system performance and finds performance bottlenecks using deep learning algorithms.

Let’s examine the benefit artificial intelligence (AI) models in PerfPsychic bring when we troubleshoot vSAN performance issues. It takes our trained AI module less than 1 second to analyze a vSAN log and to pinpoint performance bottlenecks at an accuracy rate of more than 91%. In contrast, when analyzed manually, an SR ticket on vSAN takes a seasoned performance engineer about one week to deescalate, while the durations range from 3 days to 14 days. Moreover, AI also wins over traditional analyzing algorithms by enhancing the accuracy rate from around 80% to more than 90%.

Architecture

There are two operation modes in the AI module: off-line training mode and real-time prediction mode. In the training mode, sets of training data, which are labeled with their performance issues, are automatically fed to all potential convolutional neural network (CNN) [1] structures, which we train repeatedly on GPU-enabled servers. We train thousands of models at a time and pick the one that achieves the best accuracy to a real-time system. In the real-time prediction mode, unlabeled user data are sent to the model chosen from the training stage, and a prediction of the root cause (faulty component) is provided by it.

As shown in Figure 1, data in both training and prediction modes are first sent to a data preparation module (Queried Data Preparation), where data are formatted for later stages. The data path then diverges. Let’s first follow the dashed line for the data path of labeled training data. They are sent to the deep learning training module (DL Model Training) to train an ensemble of thousands of CNNs generated from our carefully designed structures. After going through all the training data for more than thousands of times and having the training accuracy rate converged to a stable value, the trained CNNs will compete with each other in the deep learning model selection module (DL Model Selection), where they are requested to predict the root causes of testing data that the models have never seen before. Their predictions are compared to the real root causes, which are labeled by human engineers, to calculate the testing accuracy rate. Finally, we provide an ensemble of models (Trained DL Model) that achieve the best testing accuracy to the real-time prediction system.

Figure 1: Deep Learning Module Workflow

You might expect this training process to be both time consuming and resource hungry and so, it should be carried out off-line on servers equipped with powerful GPUs. On the contrary, prediction mode is relatively light-weight and can adapt to real-time applications.

Following the solid line in Figure 1 for prediction mode, the unlabeled normalized user data are sent to our carefully picked models, and the root cause (Performance Exception) is predicted based on a small amount of calculations. The prediction will be returned to the upper layer such as our interactive analytic web UI, automatic analysis, or proactive analysis applications. Like the interactive analytic part, our web UI also has a means of manually validating the prediction, which will automatically trigger the next round of model training. This completes the feedback loop and ensures our models continue to learn from human feedback.

AI Wins Over Manual Debugging

Diagnosing performance problems in a software-defined datacenter (SDDC) is difficult due to both the scale of the systems and the scale of the data. The scale of the software and hardware systems results in complicated behaviors that are not only workload-dependent but also interfering with each other. Thus, pin-pointing a root cause requires thorough examinations of the entire datacenter. However, due to the scale of data collected across a datacenter, this analysis process requires many human efforts, takes an extremely long time, and is prone to errors. Take vSAN for example—dealing with performance-related escalations typically requires cross-departmental efforts examining vSAN stacks, ESXi stacks, and physical/virtual network stacks. In some cases, it took months for many engineers to pinpoint problems outside of the VMware stack, such as physical network misconfigurations. On average, it takes one week to deescalate a client’s service request ticket with the effort of many experienced engineers working together.

PerfPsychic is designed to address challenges we have faced and to further make performance diagnostics more scalable. PerfPsychic builds upon a data infrastructure that is at least 10 times faster and 100 times more scalable than the existing one. It provides an end-to-end interactive analytic UI allowing users to perform the majority of the analysis in one place. The analysis results will then immediately be fed back to the deep learning pipeline in the backend, which produces diagnostic models that detect a faulty component more accurately as more feedback gets collected. These models mostly take only a few hours to train, and can detect faulty components in a given dataset in a few milliseconds, with comparable accuracy to rules that took us months to tune manually.

AI Wins Over Traditional Algorithms

To prove the effectiveness of our AI approach, we tested it against traditional machine learning algorithms.

First, we created two datasets: training data and testing data, as summarized in Table 1.

Table 1: Training and Testing Data Property

Training data are generated from our simulated environment: a simple 4-node hybrid vSAN setup. We manually insert performance errors into our testing environment to collect the training data with accurate labels. In the example of a network issue, we simulate packet drops by having vmkernel drop a receiving packet at VMK TCP/IP for every N packets. This mimics the behavior of packet drops in the physical network. We vary N to produce enough data points for training. Although this does not 100% reproduce what happens in a customer environment, it is still a best practice since it is the only cost-effective way to get a large volume of labeled data which are clean and accurate.

The testing data, in contrast to the training data, are all from customer escalations, which have very different system configurations in many aspects (number of hosts, types and number of disks, workloads, and so on). In our testing data, we have 78.1% of the data labeled with performance issues. Note that the “performance issue” refers to a specific component in the system that is causing the performance problem in the dataset. We define “accuracy” as the percentage of predictions that the CNN model gives the correct label to all components from the testing datasets (“issue” or “no issue”).

With the same training data, we trained one CNN, and four popular machine learning models: Support Vector Machine (SVM) [2], Logistic Classification (LOG) [3], Multi-layer Perceptron Neural Network (MLP) [4] and Multinomial Naïve Bayes (MNB) [5]. Then we tested the five models against the testing dataset. To quantify model performances, we calculate their accuracy as follows.

Finally, we compared the accuracy rate achieved by each model, which are shown in Figure 2. The result reveals that AI is a clear winner, with 91% accuracy.

Figure 2: Analytic Algorithm Accuracy Comparison

Acknowledgments

We appreciate the assistance and feedback from Chien-Chia Chen, Amitabha Banerjee and Xiaobo Huang. We also feel grateful to the support from our manager Rajesh Somasundaran. Lastly, we thank Julie Brodeur for her help in reviewing and recommendations for this blog post.

References

  1. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions. CoRR,” abs/1409.4842, 2014.
  2. J. Smola, B. Schölkopf, “A Tutorial on Support Vector Regression,” Statistics and Computing Archive Volume 14 Issue 3, August 2004, p. 199-222.
  3. C. Bishop, “Pattern Recognition and Machine Learning,” Chapter 4.3.4.
  4. E. Rumelhart, G. E. Hinton, R. J. Williams, “Learning representations by back-propagating errors,” http://www.iro.umontreal.ca/~pift6266/A06/refs/backprop_old.pdf.
  5. Zhang, “The optimality of Naive Bayes,” Proc. FLAIRS, 2004, http://www.cs.unb.ca/~hzhang/publications/FLAIRS04ZhangH.pdf.