By Uday Kurkure, Lan Vu, and Hari Sivaraman
Machine learning is an exciting area of technology that allows computers to behave without being explicitly programmed, that is, in the way a person might learn. This tech is increasingly applied in many areas like health science, finance, and intelligent systems, among others.
In recent years, the emergence of deep learning and the enhancement of accelerators like GPUs has brought the tremendous adoption of machine learning applications in a broader and deeper aspect of our lives. Some application areas include facial recognition in images, medical diagnosis in MRIs, robotics, automobile safety, and text and speech recognition.
Machine learning workloads have also become a critical part in cloud computing. For cloud environments based on vSphere, you can even deploy a machine learning workload yourself using GPUs via the VMware DirectPath I/O or vGPU technology.
GPUs reduce the time it takes for a machine learning or deep learning algorithm to learn (known as the training time) from hours to minutes. In a series of blogs, we will present the performance results of running machine learning benchmarks on VMware vSphere using NVIDIA GPUs.
This is episode 1. Also see:
- Episode 2: Machine Learning on vSphere 6 with NVIDIA GPUs
- Episode 3: Performance Comparison of Native GPU to Virtualized GPU and Scalability of Virtualized GPUs for Machine Learning
Episode 1: Performance Results of Machine Learning with DirectPath I/O and NVIDIA GPUs
In this episode, we present the performance results of running machine learning benchmarks on VMware vSphere with NVIDIA GPUs in DirectPath I/O mode and on GRID virtual GPU (vGPU) mode.
Training Time Reduction from Hours to Minutes
Training time is the performance metric used in supervised machine learning—it is the amount of time a computer takes to learn how to solve the given problem. In supervised machine learning, the computer is given data in which the answer can be found. So, supervised learning infers a model from the available, or labelled training data.
Our first machine learning benchmark is a simple demo model in the TensorFlow library. The model classifies handwritten digits from the MNIST dataset. Each digit is a handwritten number that is centered within a consistently sized grayscale bitmap. The MNIST database of handwritten digits contains 60,000 training examples and has a test set of 10,000 examples.
First, we compare training times for the model using two different virtual machine configurations:
- NVIDIA GRID Configuration (vg1c12m60GB): 1 vGPU, 12 vCPUs, 60GB memory, 96GB of SSD storage, CentOS 7.2
- No GPU configuration (g0c12m60GB): No GPU, 12 vCPUs, 60GB memory, 96GB of SSD storage, CentOS 7.2
MNIST | vg1c12m60GB 1 vGPU (secs) |
g0c12m60GB No GPU (secs) |
Normalized Training Time (wrt vg1c12) |
1.0 | 10.06 |
CPU Utilization | 8% | 43% |
The above table shows that vGPU reduces the training time by 10 times. The CPU utilization also goes down 5 times. See the graphs below.
Scaling from One GPU to Four GPUs
This machine learning benchmark is made up of two components:
- The convolutional neural network model provided in the TensorFlow library.
- The CIFAR-10 dataset, which classifies RGB images of 32×32 pixels into 10 categories: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck.
We use the metric of images per second (images/sec) to compare the different configurations as we scale from a single GPU to 4 GPUs. The metric of images/second denotes the number of images processed per second in training the model.
Our host has two NVIDIA M60 cards. Each card has 2 GPUs. We present the performance results for scaling up from 1 GPU to 4 GPUs.
You can configure the GPUs in two modes:
- DirectPath I/O passthrough: In this mode, the host can be configured to have 1 to 4 GPUs in a DirectPath I/O passthrough mode. A virtual machine running on the host will have access to 1 to 4 GPUs in passthrough mode.
- GRID vGPU mode: For machine learning workloads, each VM should be configured with the highest profile vGPU. Since we have M60 GPUs, we configured VMs with vGPU type M60-8q. M60-8q implies one VM/GPU.
DirectPath I/O
First we focus on DirectPath I/O passthrough mode as we scale from 1 GPU to 4 GPUs.
CIFAR-10 | g1c48m60GB | g2c48m60GB | g4c48m60GB |
1 GPU | 2 GPUs | 4 GPUs | |
Normalized Images/sec in Thousands (w.r.t. 1 GPU) | 1.0 | 2.04 | 3.74 |
CPU Utilization | 25% | 44% | 71% |
As the above table shows, the images processed per second improves almost linearly with the number of GPUs on the host. This means that the number of images processed becomes greater with each increase in the number of GPUs in an amount that is expected. 1 GPU sets the normalized data at 1,000 image/sec. We expect 2 GPUs to handle about double that of 1 GPU, which the graph shows. Next, we see that 4 GPUs can handle nearly 4,000 images/sec.
Host CPU utilization also increases linearly, as shown in the following graph.
Single GPU DirectPath I/O vs GRID vGPU mode
Now, we present comparison of performance results for DirectPath IO and GRID vGPU mode.
Since each VM can have only one vGPU in GRID vGPU mode, we first present the results for 1 GPU configuration in DirectPath IO mode with vGPU mode.
MNIST | g1c48m60GB | vg1c48m60GB |
(Lower Is Better) | DirectPath I/O | GRID vGPU |
Normalized Training Times | 1.0 | 1.05 |
CIFAR-10 | g1c48m60GB | vg1c48m60GB |
(Higher Is Better) | DirectPath I/O | GRID vGPU |
Normalized Images/sec | 1.0 | 0.87 |
The above tables show that one GPU configuration in DirectPath I/O and GRID mode vGPU are very close in performance. We suggest you use GRID vGPU mode because it offers the benefits of virtualization.
Multi-GPU DirectPath I/O vs Multi-VM DirectPath I/O vs Multi-VMs in GRID vGPU mode
Now we move on to multi-GPU performance results for DirectPath I/O and GRID vGPU mode. In DirectPath I/O mode, a VM can be configured with all the GPUs on the host. In our case, we configured the VM with 4 GPUs. In GRID vGPU mode, each VM can have at most 1 GPU. Therefore, we compare the results of 4 VMs running the same job with a VM using 4 GPUs using Direct Path I/O.
CIFAR-10 | g4c48m60GB | g1c12m16GB (4-vms) | vg1c12m16GB(4-vms) |
DirectPath I/O | DirectPath I/O (4 VMs) | GRID vGPU ( 4 VMs) | |
Normalized Images/Sec (Higher Is Better) |
1.0 | 0.98 | 0.92 |
CPU Utilization | 71% | 68% | 69% |
The multi-GPU DirectPath I/O mode configuration performs better. If your workload requirement is low latency or requires a short training time, you should use multi-GPU DirectPath I/O mode. However, other virtual machines will not be able use the GPUs on the host at the same time. If you can tolerate longer latencies or training times, we recommend using a 1-GPU configuration. GRID vGPU mode enables the benefits of virtualization: flexibility and elasticity.
Takeaways
- GPUs bring the training times of machine learning algorithms from hours to minutes.
- You can use NVIDIA GPUs in two modes in the VMware vSphere environment for machine learning applications:
- DirectPath I/O passthrough mode
- GRID vGPU mode
- You should use GRID vGPU mode with the highest vGPU profile. The highest vGPU profile implies 1 VM/GPU, thus giving the virtual machine full access to the entire GPU.
- For a 1-GPU configuration, the performance of the machine learning applications in GRID vGPU mode is comparable to DirectPath I/O.
- For the shortest training time, you should use a multi-GPU configuration in DirectPath I/O mode.
- For running multiple machine learning jobs simultaneously, you should use GRID vGPU mode. This configuration offers a higher consolidation of virtual machines and leverages the flexibility and elasticity benefits of VMware virtualization.
Go to Machine Learning on vSphere 6 with Nvidia GPUs – Episode 2.
References
- https://developer.nvidia.com/deep-learning
- https://pubs.vmware.com/horizon-7-view/index.jsp?topic=%2Fcom.vmware.horizonview.linuxdesktops.doc%2FGUID-5C0C3670-15D0-4892-87D6-7B7F1E4B5119.html
- http://yann.lecun.com/exdb/mnist/
- https://www.tensorflow.org
- https://www.cs.toronto.edu/~kriz/cifar.html
Configuration Details
Host Configuration
Model | Dell PowerEdge R730 |
Processor Type | Intel® Xeon® CPU E5-2680 v3 @ 2.50GHz |
CPU Cores | 24 CPUs, each @ 2.499GHz |
Processor Sockets | 2 |
Cores per Socket | 12 |
Logical Processors | 48 |
Hyperthreading | Active |
Memory | 768GB |
Storage | Local SSD (1.5TB), Storage Arrays, Local Hard Disks |
GPUs | 2x M60 Tesla |
Software Configuration
ESXi | 6.0.0, 3500742 |
Guest OS | CentOS Linux release 7.2.1511 (Core) |
CUDA Driver | 7.5 |
CUDA Runtime | 7.5 |
VM Configurations
VM | vCPUs | Memory | Storage | GPUs | Guest OS | Mode |
g0xc12m60GB | 12 vCPUs | 60GB | 1x96GB (SSD) | 0 | CentOS 7.2 | No GPU |
g1xc12m60GB | 12 vCPUs | 60GB | 1x96GB (SSD) | 1 | CentOS 7.2 | DirectPath I/O |
g2xc48m60GB | 48 vCPUs | 60GB | 1x96GB
(SSD) |
2 | CentOS 7.2 | DirectPath I/O |
g4xc48m60GB | 48 vCPUs | 60GB | 1x96GB
(SSD) |
4 | CentOS 7.2 | DirectPath I/O |
vg1xc12m60GB | 12 vCPUs | 60GB | 1x96GB (SSD) | 1 | CentOS 7.2 | GRID vGPU |
g1c12m16GB | 12 vCPUs | 16GB | 1x96GB
(SSD) |
1 | CentOS 7.2 | DirectPath I/O |
vg1c12m16GB | 12 vCPUs | 16GB | 1x96GB
(SSD) |
1 | CentOS 7.2 | GRID vGPU |