Home > Blogs > VMware VROOM! Blog > Tag Archives: deep learning

Tag Archives: deep learning

Sharing GPU for Machine Learning/Deep Learning on VMware vSphere with NVIDIA GRID: Why is it needed? And How to share GPU?

By Lan Vu, Uday Kurkure, and Hari Sivaraman 

Data scientists may use GPUs on vSphere that are dedicated to use by one virtual machine only for their modeling work, if they need to. Certain heavier machine learning workloads may well require that dedicated approach. However, there are also many ML workloads and user types that do not use a dedicated GPU continuously to its maximum capacity. This presents an opportunity for shared use of a physical GPU by more than one virtual machine/user. This article explores the performance of a shared-GPU setup like this, supported by the NVIDIA GRID product on vSphere, and presents performance test results that show that sharing is a feasible approach. The other technical reasons for sharing a GPU among multiple VMs are also described here. The article also gives best practices for determining how the sharing of a GPU may be done.

VMware vSphere supports NVIDIA GRID technology for multiple types of workloads. This technology virtualizes GPUs via a mediated passthrough mechanism. Initially, NVIDIA GRID supported GPU virtualization for graphics workloads only. But, since the introduction of Pascal GPU, NVIDIA GRID has supported GPU virtualization for both graphics and CUDA/machine learning workloads. With this support, multiple VMs running GPU-accelerated workloads like machine learning/deep learning (ML/DL) based on TensorFlow, Keras, Caffe, Theano, Torch, and others can share a single GPU by using a vGPU provided by GRID. This brings benefits in multiple use cases that we discuss on this post.  

Each vGPU is allocated a dedicated amount of GPU memory and a vGPU profile specifies how much device memory each vGPU has and maximum number of vGPUs per physical GPU. For example, if you choose the P40-1q vGPU profile for Pascal P40 GPU, you can have up to 24 VMs with vGPU because P40 has total of 24 GB device memory. More information about virtualized GPUs on vSphere can be found at our previous blog here. 

Figure 1: NVIDIA GRID vGPU 

Why do we need to share GPUs?

Sharing GPUs can help increase system consolidation, resource utilization, and save deployment costs of ML/DL workloads. GPU-accelerated ML/DL workloads include training and inference tasks, and their GPU usage patterns are different. Training workloads are mostly run by data scientists and machine learning engineers during the research and development phase of an application. Because model training is just one of many tasks of ML application development, the need of GPUs by each user is usually irregular. For example, a data scientist does not spend the whole workday just training models because he/she has other things to do like checking & answering emails, attending meetings, researching and developing new ML algorithms, collecting and cleaning data, and so on. Hence, sharing GPUs among a group of multiple users helps increase the GPU utilization while not reducing much the performance benefits of GPU.  

To illustrate this scenario of using GPU for training, we conducted an experiment in which 3 VMs (or 3 users) used vGPU to share a single NVIDIA P40 GPU, and each VM ran the same ML/DL training workload at different times. ML workloads inside VM1 and VM2 were run at the times t1 and t2, so that about 25% of the GPU execution time of VM1 and VM2 were overlapped. VM3 ran its workload at t3, and it was the only GPU-based workload run at that timeframe. Figure 2 depicts this use case in which the black dash arrows indicate VMs access GPU concurrently. If you run your applications inside container, please also check out our previous blog post on running container-based applications inside a VM.

Figure 2: A use case of running multiple ML jobs on VMs with vGPUs

In our experiments, we used CentOS VMs with P40-1q vGPU profiles, 12 vCPUs, 60 GB memory, 96 GB disk, and ran TensorFlow-based training loads on those VMs, including complex language modeling using a recurrent neural network (RNN) with 1500 long short-term memory (LSTM) units per layer, on the Penn Treebank dataset (PTB) [1, 2], and handwriting recognition using a convolution neural network (CNN) with a MNIST dataset [3]. We ran the experiment on a Dell PowerEdge R740 with dual 18-core Intel Xeon Gold 6140 sockets and an NVIDIA Pascal P40 GPU.  

Figure 3 and Figure 4 show the normalized training time of VM1, VM2, and VM3 in which VM1 and VM2 have a performance impact of 16%–23%, while VM3 has no impact on the performance. In this experiment, we used the Best Effort scheduler of GRID which means VM3 fully utilized the GPU time during its application execution.

Figure 3Training time of Language Modeling 

Figure 4: Training time of Handwriting Recognition 

For inference workloads, the performance characteristics can vary based on the usage frequency of the GPU-based applications on the production environment. Less intensive GPU workloads allow more more apps running inside VMs sharing a single GPU. For example, a GPU-accelerated database app and other ML/DL apps can share the same GPUs on the same vSphere host if their performance requirements are still met.

How many vGPU per physical GPU is good?

The decision of sharing GPU among ML/DL workloads running on multiple VMs and how many VMs per physical GPU depends on the GPU usage of ML applications. When users or applications do not use the GPU very frequently, as shown in the previous example, sharing the GPU can bring huge benefits because it significantly reduces the hardware, operation, and management costs. In this case, you can assign more vGPU per physical GPU. If your workloads use GPU intensively and require continuous access to the GPU, sharing it can still bring some benefits because GPU-based application execution includes CPU time, GPU time, I/O time, and so on. Additionally, sharing a GPU helps fill the gap when applications spend time on CPU or I/O. However, in this case, you need to assign fewer vGPUs per physical GPU.

To determine how many VMs with vGPU per physical GPU are needed, you can base this on your evaluation of usage frequency or the GPU utilization history of the applications. In the case of GRID GPU on vSphere, you can monitor GPU utilization information by using the command nvidia-smi on the vSphere hypervisor. 

We evaluated the performance of ML/DL workloads, in the worst case, when all VMs use a GPU at the same time. To do this, we ran the same MNIST handwriting recognition training on multiple VMs with each vGPU concurrently sharing a single Pascal P40 GPU. Each VM had a P40-1q vGPU.  

The experiment in this scenario is depicted in Figure 5 with the number of concurrent VMs in our test ranging from 1 to 24 VMs.  

Figure 5: Running multiple ML jobs on VMs with vGPUs concurrently 

Figure 6 presents the normalized training time of this experiment. As the number of concurrent ML jobs increases, the training time of each job also increases because they share a single GPU. However, the increase of time is not as fast as the increase of VM. For example, when we have 24 VMs run concurrently, the execution time increases, at most, 17 times instead of 24 times or higherThis means that even in the worst case, where all VMs use the GPU at the same time, we still see the benefits of GPU sharing. Please note that in the typical use case of training as mentioned earliernot all users or applications use the GPU 24/7. If they do, you can just reduce the number of vGPUs per GPU until the expected performance and consolidation are reached

Figure 6: Training time with different number of VM

vGPU scheduling

When all VMs with GPU loads run concurrently, NVIDIA GRID manager schedules the jobs into the GPU based on time slicing. NVIDIA GRID supports three vGPU scheduling options: Best Effort, Equal Share, and Fixed Share. The selection of a vGPU scheduling option depends on use cases. The Best Effort scheduler allocates GPU time to VMs in a round-robin fashion. In the above experiments, we used the Best Effort scheduler. For some circumstances, a VM running a GPU-intensive application may affect the performance of a GPU-lightweight application running in other VMs. To avoid such performance impact and ensure quality of service (QoS), you can choose to switch to the Equal Share or Fixed Share scheduler. The Equal Share scheduler ensures equal share of GPU time for each powered-on VM. The Fixed Share scheduler gives a fixed share of GPU time to a VM based on the vGPU profile that is associated with each VM on the physical GPU.  

For performance comparison, we run the MNIST handwriting recognition training load using different schedulers: Best Effort and Equal Share for different number of VMs.  

Figure 7 presents the normalized training time and Figure 8 presents GPU utilization. As the number of VMs increase, Best Effort shows better performance because when a VM does not use its time slice, that time slice will be assigned to another VM that needs GPU. Meanwhile, for Equal Share, that time slice is always reserved for the VMs even if they do not utilize GPU at that moment. Therefore, Best Effort Scheduler has better GPU utilization as shown in Figure 7. 

Figure 7: Training time of Best Effort vs. Equal Share

Figure 8: GPU utilization of Best Effort vs. Equal Share 

Takeaways

  • Sharing a GPU among VMs using NVDIA GRID can help increase the consolidation of VMs with vGPU and reduce the hardware, operation, and management costs. 
  • The performance impact of sharing a GPU is small in typical use cases when the GPU used is infrequently by users. 
  • Choosing how many vGPUs per GPU is based on the ML/DL real load. For infrequent and lightweight GPU workloads, you can assign multiple vGPUs per GPU. For workloads that frequently use GPU, you should lower the number of vGPUs per GPU until the performance requirement is met.  

Acknowledgments

We would like to thank Aravind Bappanadu, Juan Garcia-Rovetta, Bruce Herndon, Don Sullivan, Charu Chaubal, Mohan Potheri, Gina Rosenthal, Justin Murray, Ziv Kalmanovich for their support of this work and thank Julie Brodeur for her help in reviewing and recommendations for this blog post.

References

[1] Wojciech Zaremba, Ilya Sutskever, Oriol Vinyals, “Recurrent Neural Network Regularization,” In arXiv:1409.2329, 2014. 

[2] Ann Taylor, Mitchell Marcus, Beatrice Santorini, “The Penn Treebank: An Overview, Treebanks: the state of the art in syntactically annotated corpora.” ed. / Anne Abeille. Kluwer, 2003.  

[3] Yann LeCun, L. Bottou, Y. Bengio, and P. Haffner. “Gradient-based learning applied to document recognition.” in Proceedings of the IEEE, 86(11):2278-2324, November 1998.  

 

 

VMware’s AI-based Performance Tool Can Improve Itself Automatically

PerfPsychic  our AI-based performance analyzing tool, enhances its accuracy rate from 21% to 91% with more data and training when debugging vSAN performance issues. What is better, PerfPsychic can continuously improve itself and the tuning procedure is automated. Let’s examine how we achieve this in the following sections.

How to Improve AI Model Accuracy

Three elements have huge impacts on the training results for deep learning models: amount of high-quality training data, reasonably configured hyperparameters that are used to control the training process, and sufficient but acceptable training time. In the following examples, we use the same training and testing dataset as we presented in our previous blog.

Amount of Training Data

The key of PerfPsychic is to prove the effectiveness of our deep learning pipeline, so we start by gradually adding more labeled data to the training dataset. This is to demonstrate how our models learn from more labeled data and improve their accuracy over time. Figure 1 shows the results where we start from only 20% of the training dataset and gradually label 20% more each time. It shows a clear trend that as more properly labeled data is added, our model learns and improves its accuracy, without any further human intervention. The accuracy is improved from around 50% when we have only about 1,000 data points to 91% when we have the full set of 5,275 data points. Such accuracy is as good as a programmatic analytic rule that took us three months to tune manually.

Figure 1. Accuracy improvement over larger training datasets

Training Hyperparameters

We next vary several other CNN hyperparameters to demonstrate how they were selected for our models. We change only one hyperparameter at a time and train 1,000 CNNs using the configuration. We first vary a different number of iterations in training, namely for how many times we go through the training dataset. If the number of iterations is too few, the CNNs cannot be trained adequately and, if the iteration number is too large, training will take a much longer time and it also might end up overfitting to the training data. As shown in Figure 2, between 50 to 75 iterations is the best range, where 75 iterations achieve the best accuracy of 91%.

Figure 2. Number of training iterations vs. accuracy

We next vary the step size, which is our granularity to search for the best model. In practice, with a small step size, the optimization is so slow that it cannot reach the optimal point in a limited time. With a large step size, we risk passing optimal points easily. Figure 3 shows that, between 5e-3 to 7.5e-3, the model produces good accuracy, where 5e-3 predicts 91% of the labels correctly.

Figure. 3 Step size vs. accuracy

We last evaluate the impact of issue rate of the training data in terms of accuracy. Issue rate is the percentage of training data that represents performance issues among the total. In an ideal set of training data, all the labels should be equally represented to avoid overfitting. A biased dataset generally results in overfitting models that can barely  achieve high accuracy. Figure 4 below shows that when the training data have under 20% of issue rate (that is, under 20% of the components are faulty), the model basically overfits to “noissue” data points and predicts all components are issue-free. Because our testing data have 21.9% of components without issues, it stays at 21.9% accuracy. In contrast, when we have over 80% of an issue rate, the model simply treats all components as faulty and thus achieves the 78.1% accuracy. This explains why it is important to ensure every label is equally represented, and why we mix our issue/noissue data in a ratio between 40% to 60%.

Figure 4. Impact of issue rate

Training Duration

Training time is also an important factor in a practical deep learning pipeline design. As we train thousands of CNN models, spending one second longer to train a model means a whole training phase will take 1,000 seconds longer. Figure 9 below shows the training time vs. data size and the number of iterations. As we can see, both factors form a linear trend; that is, with more data and more iterations, training will take linearly longer. Fortunately, we know from the study above that any more than 75 iterations will not help accuracy. By limiting the number of iterations, we can complete a whole phase of training in less than 9 hours. Again, once the off-line training is done, the model can perform real-time prediction in just a few milliseconds. The training time simply affects how often and how fast the models can pick up new feedback from product experts.

Figure 5. Effect of data size and iteration on training time

Automation

The model selection procedure is fully automated. Thousands of models with different hyperparameter settings are training in parallel on our GPU-enabled servers. The trained results compete with each other by analyzing our prepared testing data and reporting the final results. We then pick the model with the highest correct rate, put it into PerfPsychic and use it for online analysis. Moreover, we also keep a record of the parameters in the the winning models and use them as initial setups in future trainings. Therefore, our models can keep evolving.

PerfPsychic in Application

PerfPsychic is not only a research product, but also an internal performance analysis tool which is widely used. Now it is used to automatically analyze vSAN performance bugs on Bugzilla.

PerfPsychic automatically detects new vSAN performance bugs submitted in Bugzilla and extracts its usable data logs in the bug attachment. Then it analyzes the logs with the trained models. Finally, the analysis results are emailed to bug submitters and vSAN developer group where performance enhancement suggestions are included.

Below is part of an email received yesterday that gives performance tuning advice on a vSAN bug. Internal information are hidden.

Figure 6. Part of email generated by PerfPsychic to offer performance improvement suggestions

 

VMware Speedily Resolves Customer Issues in vSAN Performance Using AI

We in VMware’s Performance team create and maintain various tools to help troubleshoot customer issues—of these, there is a new one that allows us to more quickly determine storage problems from vast log data using artificial intelligence. What used to take us days, now takes seconds. PerfPsychic analyzes storage system performance and finds performance bottlenecks using deep learning algorithms.

Let’s examine the benefit artificial intelligence (AI) models in PerfPsychic bring when we troubleshoot vSAN performance issues. It takes our trained AI module less than 1 second to analyze a vSAN log and to pinpoint performance bottlenecks at an accuracy rate of more than 91%. In contrast, when analyzed manually, an SR ticket on vSAN takes a seasoned performance engineer about one week to deescalate, while the durations range from 3 days to 14 days. Moreover, AI also wins over traditional analyzing algorithms by enhancing the accuracy rate from around 80% to more than 90%.

Architecture

There are two operation modes in the AI module: off-line training mode and real-time prediction mode. In the training mode, sets of training data, which are labeled with their performance issues, are automatically fed to all potential convolutional neural network (CNN) [1] structures, which we train repeatedly on GPU-enabled servers. We train thousands of models at a time and pick the one that achieves the best accuracy to a real-time system. In the real-time prediction mode, unlabeled user data are sent to the model chosen from the training stage, and a prediction of the root cause (faulty component) is provided by it.

As shown in Figure 1, data in both training and prediction modes are first sent to a data preparation module (Queried Data Preparation), where data are formatted for later stages. The data path then diverges. Let’s first follow the dashed line for the data path of labeled training data. They are sent to the deep learning training module (DL Model Training) to train an ensemble of thousands of CNNs generated from our carefully designed structures. After going through all the training data for more than thousands of times and having the training accuracy rate converged to a stable value, the trained CNNs will compete with each other in the deep learning model selection module (DL Model Selection), where they are requested to predict the root causes of testing data that the models have never seen before. Their predictions are compared to the real root causes, which are labeled by human engineers, to calculate the testing accuracy rate. Finally, we provide an ensemble of models (Trained DL Model) that achieve the best testing accuracy to the real-time prediction system.

Figure 1: Deep Learning Module Workflow

You might expect this training process to be both time consuming and resource hungry and so, it should be carried out off-line on servers equipped with powerful GPUs. On the contrary, prediction mode is relatively light-weight and can adapt to real-time applications.

Following the solid line in Figure 1 for prediction mode, the unlabeled normalized user data are sent to our carefully picked models, and the root cause (Performance Exception) is predicted based on a small amount of calculations. The prediction will be returned to the upper layer such as our interactive analytic web UI, automatic analysis, or proactive analysis applications. Like the interactive analytic part, our web UI also has a means of manually validating the prediction, which will automatically trigger the next round of model training. This completes the feedback loop and ensures our models continue to learn from human feedback.

AI Wins Over Manual Debugging

Diagnosing performance problems in a software-defined datacenter (SDDC) is difficult due to both the scale of the systems and the scale of the data. The scale of the software and hardware systems results in complicated behaviors that are not only workload-dependent but also interfering with each other. Thus, pin-pointing a root cause requires thorough examinations of the entire datacenter. However, due to the scale of data collected across a datacenter, this analysis process requires many human efforts, takes an extremely long time, and is prone to errors. Take vSAN for example—dealing with performance-related escalations typically requires cross-departmental efforts examining vSAN stacks, ESXi stacks, and physical/virtual network stacks. In some cases, it took months for many engineers to pinpoint problems outside of the VMware stack, such as physical network misconfigurations. On average, it takes one week to deescalate a client’s service request ticket with the effort of many experienced engineers working together.

PerfPsychic is designed to address challenges we have faced and to further make performance diagnostics more scalable. PerfPsychic builds upon a data infrastructure that is at least 10 times faster and 100 times more scalable than the existing one. It provides an end-to-end interactive analytic UI allowing users to perform the majority of the analysis in one place. The analysis results will then immediately be fed back to the deep learning pipeline in the backend, which produces diagnostic models that detect a faulty component more accurately as more feedback gets collected. These models mostly take only a few hours to train, and can detect faulty components in a given dataset in a few milliseconds, with comparable accuracy to rules that took us months to tune manually.

AI Wins Over Traditional Algorithms

To prove the effectiveness of our AI approach, we tested it against traditional machine learning algorithms.

First, we created two datasets: training data and testing data, as summarized in Table 1.

Table 1: Training and Testing Data Property

Training data are generated from our simulated environment: a simple 4-node hybrid vSAN setup. We manually insert performance errors into our testing environment to collect the training data with accurate labels. In the example of a network issue, we simulate packet drops by having vmkernel drop a receiving packet at VMK TCP/IP for every N packets. This mimics the behavior of packet drops in the physical network. We vary N to produce enough data points for training. Although this does not 100% reproduce what happens in a customer environment, it is still a best practice since it is the only cost-effective way to get a large volume of labeled data which are clean and accurate.

The testing data, in contrast to the training data, are all from customer escalations, which have very different system configurations in many aspects (number of hosts, types and number of disks, workloads, and so on). In our testing data, we have 78.1% of the data labeled with performance issues. Note that the “performance issue” refers to a specific component in the system that is causing the performance problem in the dataset. We define “accuracy” as the percentage of predictions that the CNN model gives the correct label to all components from the testing datasets (“issue” or “no issue”).

With the same training data, we trained one CNN, and four popular machine learning models: Support Vector Machine (SVM) [2], Logistic Classification (LOG) [3], Multi-layer Perceptron Neural Network (MLP) [4] and Multinomial Naïve Bayes (MNB) [5]. Then we tested the five models against the testing dataset. To quantify model performances, we calculate their accuracy as follows.

Finally, we compared the accuracy rate achieved by each model, which are shown in Figure 2. The result reveals that AI is a clear winner, with 91% accuracy.

Figure 2: Analytic Algorithm Accuracy Comparison

Acknowledgments

We appreciate the assistance and feedback from Chien-Chia Chen, Amitabha Banerjee and Xiaobo Huang. We also feel grateful to the support from our manager Rajesh Somasundaran. Lastly, we thank Julie Brodeur for her help in reviewing and recommendations for this blog post.

References

  1. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions. CoRR,” abs/1409.4842, 2014.
  2. J. Smola, B. Schölkopf, “A Tutorial on Support Vector Regression,” Statistics and Computing Archive Volume 14 Issue 3, August 2004, p. 199-222.
  3. C. Bishop, “Pattern Recognition and Machine Learning,” Chapter 4.3.4.
  4. E. Rumelhart, G. E. Hinton, R. J. Williams, “Learning representations by back-propagating errors,” http://www.iro.umontreal.ca/~pift6266/A06/refs/backprop_old.pdf.
  5. Zhang, “The optimality of Naive Bayes,” Proc. FLAIRS, 2004, http://www.cs.unb.ca/~hzhang/publications/FLAIRS04ZhangH.pdf.

Machine Learning on VMware vSphere 6 with NVIDIA GPUs

by Uday Kurkure, Lan Vu, and Hari Sivaraman

Machine learning is an exciting area of technology that allows computers to behave without being explicitly programmed, that is, in the way a person might learn. This tech is increasingly applied in many areas like health science, finance, and intelligent systems, among others.

In recent years, the emergence of deep learning and the enhancement of accelerators like GPUs has brought the tremendous adoption of machine learning applications in a broader and deeper aspect of our lives. Some application areas include facial recognition in images, medical diagnosis in MRIs, robotics, automobile safety, and text and speech recognition.

Machine learning workloads have also become a critical part in cloud computing. For cloud environments based on vSphere, you can even deploy a machine learning workload yourself using GPUs via the VMware DirectPath I/O or vGPU technology.

GPUs reduce the time it takes for a machine learning or deep learning algorithm to learn (known as the training time) from hours to minutes. In a series of blogs, we will present the performance results of running machine learning benchmarks on VMware vSphere using NVIDIA GPUs.

This is episode 1. Also see:

Episode 1: Performance Results of Machine Learning with DirectPath I/O and NVIDIA GPUs

In this episode, we present the performance results of running machine learning benchmarks on VMware vSphere with NVIDIA GPUs in DirectPath I/O mode and on GRID virtual GPU (vGPU) mode.

Continue reading