Home > Blogs > VMware VROOM! Blog > Tag Archives: vGPU

Tag Archives: vGPU

Performance Comparison of Containerized Machine Learning Applications Running Natively with Nvidia vGPUs vs. in a VM – Episode 4

This article is by Hari Sivaraman, Uday Kurkure, and Lan Vu from the Performance Engineering team at VMware.

Performance Comparison of Containerized Machine Learning Applications

Docker containers [6] are rapidly becoming a popular environment in which to run different applications, including those in machine learning [1, 2, 3]. NVIDIA supports Docker containers with their own Docker engine utility, nvidia-docker [7], which is specialized to run applications that use NVIDIA GPUs.

The nvidia-docker container for machine learning includes the application and the machine learning framework (for example, TensorFlow [5]) but, importantly, it does not include the GPU driver or the CUDA toolkit.

Docker containers are hardware agnostic so, when an application uses specialized hardware like an NVIDIA GPU that needs kernel modules and user-level libraries, the container cannot include the required drivers. They live outside the container.

One workaround here is to install the driver inside the container and map its devices upon launch. This workaround is not portable since the versions inside the container need to match those in the native operating system.

The nvidia-docker engine utility provides an alternate mechanism that mounts the user-mode components at launch, but this requires you to install the driver and CUDA in the native operating system before launch. Both approaches have drawbacks, but the latter is clearly preferable.

In this episode of our series of blogs [8, 9, 10] on machine learning in vSphere using GPUs, we present a comparison of the performance of MNIST [4] running in a container on CentOS executing natively with MNIST running in a container inside a CentOS VM on vSphere. Based on our experiments, we demonstrate that running containers in a virtualized environment, like a CentOS VM on vSphere, suffers no performance penlty, while benefiting from the tremenduous management capabilities offered by the VMware vSphere platform.

Experiment Configuration and Methodology

We used MNIST [4] to compare the performance of containers running natively with containers running inside a VM. The configuration of the VM and the vSphere server we used for the “virtualized container” is shown in Table 1. The configuration of the physical machine used to run the container natively is shown in Table 2.

vSphere  6.0.0, build 3500742
Nvidia vGPU driver 367.53
Guest OS CentOS Linux release 7.4.1708 (Core)
CUDA driver 8.0
CUDA runtime 7.5
Docker 17.09-ce-rc2

Table 1. Configuration of VM used to run the nvidia-docker container

Nvidia driver 384.98
Operating system CentOS Linux release 7.4.1708 (Core)
CUDA driver 8.0
CUDA runtime 7.5
Docker 17.09-ce-rc2

⇑ Table 2. Configuration of physical machine used to run the nvidia-docker container

The server configuration we used is shown in Table 3 below. In our experiments, we used the NVIDIA M60 GPU in vGPU mode only. We did not use the Direct I/O mode. In the scenario in which we ran the container inside the VM, we first installed the NVIDIA vGPU drivers in vSphere and inside the VM, then we installed CUDA (driver 8.0 with runtime version 7.5), followed by Docker and nvidia-docker [7]. In the case where we ran the container natively, we installed the NVIDIA driver in CentOS running natively, followed by CUDA (driver 8.0 with runtime version 7.5),  Docker and finally, nvidia-docker [7]. In both scenarios we ran MNIST and we measured the run time for training using a wall clock.

 Figure 1. Testbed configuration for comparison of the performance of containers running natively vs. running in a VM

Model Dell PowerEdge R730
Processor type Intel® Xeon® CPU E5-2680 v3 @ 2.50GHz
CPU cores 24 CPUs, each @ 2.5GHz
Processor sockets 2
Cores per socket 14
Logical processors 48
Hyperthreading Active
Memory 768GB
Storage Local SSD (1.5TB), Storage Arrays, Local Hard Disks
GPUs 2x M60 Tesla

⇑ Table 3. Server configuration

Results

The measured wall-clock run times for MNIST are shown in Table 4 for the two scenarios we tested:

  1. Running in an nvidia-docker container in CentOS running natively.
  2. Running in an nvidia-docker container inside a CentOS VM on vSphere.

From the data, we can clearly see that there is no measurable performance penalty for running a container inside a VM as compred to running it natively.

Configuration Run time for MNIST as measured by a wall clock
Nvidia-docker container in CentOS running natively 44 minutes 53 seconds
Nvidia-docker container running in a CentOS VM on vSphere 44  minutes 57 seconds

⇑ Table 4. Comparison of the run-time for MNIST running in a container on native CentOS vs. in a container in virtualized CentOS

Takeaways

  • Based on the results shown in Table 4, it is clear that there is no measurable performance impact due to running a containerized application in a virtual environment as opposed to running it natively. So, from a performance perspective, there is no penalty for using a virtualized environment.
  • It is important to note that since containers do not include the GPU driver or the CUDA environment, both of these components need to be installed separately. It is in this aspect that a virtualized environment offers a superior user experience; an nvidia-docker container in CentOS running natively requires that any existing GPU and CUDA drivers be removed if the version of the drivers does not match that required by the container. Uninstalling and re-installing the correct drivers is often a challenging and time consuming task. However, in a virtualized environment, you can, in advance, create and store in a repository, a number of CentOS VMs with different VGPU and CUDA drivers. When you need to run an application in an nvidia-docker container, just clone the VM with the correct drivers, load the container, and run with no performance penalty. In such a scenario, running in a virtualized environment does not require you to uninstall and re-install the correct drivers, which saves both time and considerable frustration. This issue of uninstalling and re-installing drivers in a native environment becomes considerably more difficult if there are multiple container users on the system; in such a scenario, all the containers need to be migrated to use the new drivers, or the user who needs a new driver will have to wait until all the other users are done before a system administrator can upgrade the GPU drivers on the native CentOS.

Future Work

In this blog, we presented the performance results of running MNIST in a single container. We plan to run MNIST in multiple containers running concurrently in both a virtualized environment and on CentOS executing natively, and report the measured run times. This will provide a comparison of the performance as we scale up the number of containers.

References

  1. Google Cloud Platform: Cloud AI. https://cloud.google.com/products/machine-learning/
  2. Wikipedia: Deep Learning. https://en.wikipedia.org/wiki/Deep_learning
  3. NVIDIA GPUs – The Engine of Deep Learning. https://developer.nvidia.com/deep-learning
  4. The MNIST Database of Handwritten Digits. http://yann.lecun.com/exdb/mnist/
  5. TensorFlow: An Open-Source Software Library for Machine Intelligence. https://www.tensorflow.org
  6. Wikipedia: Operating-System-Level Virtualization. https://en.wikipedia.org/wiki/Operating-system-level_virtualization
  7. NVIDIA Docker: GPU Server Application Deployment Made Easy. https://devblogs.nvidia.com/parallelforall/nvidia-docker-gpu-server-application-deployment-made-easy/
  8. Episode 1: Performance Results of Machine Learning with DirectPath I/O and GRID vGPU. https://blogs.vmware.com/performance/2016/10/machine-learning-vsphere-nvidia-gpus.html
  9. Episode 2: Machine Learning on vSphere 6 with NVIDIA GPUs. https://blogs.vmware.com/performance/2017/03/machine-learning-vsphere-6-5-nvidia-gpus-episode-2.html
  10. Episode 3: Performance Comparison of Native GPU to Virtualized GPU and Scalability of Virtualized GPUs for Machine Learning. https://blogs.vmware.com/performance/2017/10/episode-3-performance-comparison-native-gpu-virtualized-gpu-scalability-virtualized-gpus-machine-learning.html 

Episode 3: Performance Comparison of Native GPU to Virtualized GPU and Scalability of Virtualized GPUs for Machine Learning

In our third episode of machine learning performance with vSphere 6.x, we look at the virtual GPU vs. the physical GPU. In addition, we extend the performance results of machine learning workloads using VMware DirectPath I/O (passthrough) vs. NVIDIA GRID vGPU that have been partially addressed in previous episodes:

Machine Learning with Virtualized GPUs

Performance is one of the biggest concerns that keeps high performance computing (HPC) users from choosing virtualization as the solution for deploying HPC applications despite virtualization benefits such as reduced administration costs, resource utilization efficiency, energy saving, and security. However, with the constant evolution of virtualization technologies, the performance gaps between bare metal and virtualization have almost disappeared, and, in some use cases, virtualized applications can achieve better performance than running on bare metal because of the intelligent and highly optimized resource utilization of hypervisors. For example, a prior study [1] shows that vector machine applications running on a virtualized cluster of 10 servers have a better execution time than running on bare metal.

Virtual GPU vs. Physical GPU

To understand the performance impact of machine learning with GPUs using virtualization, we used a complex language modeling application—predicting next words given a history of previous words using a recurrent neural network (RNN) with 1500 Long Short Term Memory (LSTM) units per layer, on the Penn Treebank (PTB) dataset [2, 3], which has:

  • 929,000 training words
  • 73,000 validation words
  • 82,000 test words
  • 10,000 vocabulary words

We tested three cases:

  • A physical GPU installed on bare metal (this is the “native” configuration)
  • A DirectPath I/O GPU inside a VM on vSphere 6
  • A GRID vGPU (that is, an M60-8Q vGPU profile with 8GB memory) inside a VM on vSphere 6

The VM in the last two cases has 12 virtual CPUs (vCPUs), 60GB RAM, and 96GB SSD storage.

The benchmark was implemented using TensorFlow [4], which was also used for the implementation of the other machine learning benchmarks in our experiments. We used CUDA 7.5, cuDNN 5.1, and CentOS 7.2 for both native and guest operating systems. These test cases were run on a Dell PowerEdge R730 server with dual 12-core Intel Xeon Processors E5-2680 v3, 2.50 GHz sockets (24 physical core, 48 logical with hyperthreading enabled), 768 GB memory, and an SSD (1.5 TB). This server also had two NVIDIA Tesla M60 cards (each has two GPUs) for a total of 4 GPUs where each had 2048 CUDA cores, 8GB memory, 36 x H.264 video 1080p 30 streams, and could support 1–32 GRID vGPUs whose memory profiles ranged from 512MB to 8GB. This experimental setup was used for all tests presented in this blog (Figure 1, below).

Figure 1. Testbed configurations for native GPU vs. virtual GPU comparison

The results in Figure 2 (below) show the relative execution times of DirectPath I/O and GRID vGPU compared to native GPU. Virtualization introduces a 4% overhead—the performance of DirectPath I/O and GRID vGPU are similar. These results are consistent with prior studies of virtual GPU performance with passthrough where overheads in most cases are less than 5% [5, 6].

Figure 2. DirectPath I/O and NVIDIA GRID vs. native GPU

GPU vs. CPU in a Virtualization Environment

One important benefit of using GPU is the shortening of the long training times of machine learning tasks, which has boosted the results of AI research and developments in recent years. In many cases, it helps to reduce execution times from weeks/days to hours/minutes. We illustrate this benefit in Figure 3 (below), which shows the training time with and without vGPU for two applications:

  • RNN with PTB (described earlier)
  • CNN with MNIST—a handwriting recognizer that uses a convolution neural network (CNN) on the MNIST dataset [7].

From the results, we see that the training time for RNN on PTB with CPU was 7.9 times higher than with vGPU training time (Figure 3-a).  The training time for CNN on MNIST with CPU was 10.1 times higher than with the vGPU training time (Figure 3-b). The VM used in this test has 1 vGPU, 12 vCPUs, 60 GB memory, 96 GB of SSD storage and the test setup is similar to that of the above experiment.

Figure 3. Normalized training time of PTB, MNIST with and without vGPU

As the test results show, we can successfully run machine learning applications in a vSphere 6 virtualized environment, and its performance is similar to training times for machine learning applications running in a native configuration (not virtualized) using physical GPUs.

But what about a passthrough scenario? How does a machine learning application run in a vSphere 6 virtual machine using a passthrough to the physical GPU vs. using a virtualized GPU? We present our findings in the next section.

Comparison of DirectPath I/O and GRID vGPU

We evaluate the performance, scalability, and other benefits of DirectPath I/O and GRID vGPU. We also provide some recommendations of the best use cases for each virtual GPU solutions.

Performance

To compare the performance of DirectPath I/O and GRID vGPU, we benchmarked them with RNN on PTB, and CNN on MNIST and CIFAR-10. CIFAR-10 [8] is an object classification application that categorizes RGB images of 32×32 pixels into 10 categories: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. MNIST is a handwriting recognition application. Both CIFAR-10 and MNIST use a convolutional neural network. The language model used to predict words is based on history using a recurrent neural network. The dataset used is The Penn Tree Bank (PTB).

Fig. 4. Performance comparison of DirectPath I/O and GRID vGPU.

The results in Figure 4 (above) show the comparative performance of the two virtualization solutions in which DirectPath I/O achieves slightly better performance than GRID vGPU. This improvement is due to the passthrough mechanism of DirectPath I/O adding minimal overhead to GPU-based workloads running inside a VM. In Figure 4-a, DirectPath I/O is about 5% faster than GRID vGPU for MNIST, and they have the same performance with PTB. For CIFAR-10, DirectPath I/O can process about 13% more images per second than GRID vGPU. We use images per second for CIFAR-10 because it is a frequently used metric for this dataset. The VM in this experiment has 12 vCPU, 60GB VRAM and one GPU (either DirectPath I/O or GRID vGPU).

Scalability

We look at two types of scalability: user and GPU.

User Scalability

In a cloud environment, multiple users can share physical servers, which helps to better utilize resources and save cost. Our test server with 4 GPUs can allow up to 4 users needing a GPU. Alternatively, a single user can have four VMs with a vGPU.  The number of virtual machines run per machine in a cloud environment is typically high to increase utilization and lower costs [9]. Machine learning workloads are typically much more resource intensive and using our 4 GPU test systems for up to only 4 users reflects this.

Figure 5. Scaling the number of VMs with vGPU on CIFAR-10

Figure 5 (above) presents the scalability of users on CIFAR-10 from 1 to 4 where each uses a VM with one GPU, and we normalize images per second to that of the DirectPath I/O – 1 VM case (Figure 5-a).  Similar to the previous comparison, DirectPath I/O and GRID vGPU show comparable performance as the number of VMs with GPUs scale. Specifically, the performance difference between them is 6%–10% for images per second and 0%–1.5% for CPU utilization. This difference is not significant when weighed against the benefits that vGPU brings. Because of its flexibility and elasticity, it is a good option for machine learning workloads. The results also show that the two solutions scale linearly with the number of VMs both in terms of execution time and CPU resource utilization. The VMs used in this experiment have 12 vCPUs, 16GB memory, and 1 GPU (either DirectPath I/O or GRID vGPU).

GPU Scalability

For machine learning applications that need to build very large models or in which the datasets cannot fit into a single GPU, users can use multiple GPUs to distribute the workloads among them and speed up the training task further. On vSphere, applications that require multiple GPUs can use DirectPath I/O passthrough to configure VMs with as many GPUs as required. This capability is limited for CUDA applications using GRID vGPU because only 1 vGPU per VM is allowed for CUDA computations.

We demonstrate the efficiency of using multiple GPUs on vSphere by benchmarking the CIFAR-10 workload and using the metric of images per second (images/sec) to compare the performance of CIFAR-10 on a VM with different numbers of GPUs scaling from 1 to 4 GPUs.

From the results in Figure 6 (below), we found that the images processed per second improves almost linearly with the number of GPUs on the host (Figure 6-a). At the same time, their CPU utilization also increases linearly (Figure 6-b). This result shows that machine learning workloads scale well on the vSphere platform. In the case of machine learning applications that require more GPUs than the physical server can support, we can use the distributed computing model with multiple distributed processes using GPUs running on a cluster of physical servers. With this approach, both DirectPath I/O and GRID vGPU can be used to enhance scalability with a very large number of GPUs.

Figure 6. Scaling the number of GPUs per VM on CIFAR-10

How to Choose Between DirectPath I/O and GRID vGPU

For DirectPath I/O

From the above results, we can see that DirectPath I/O and GRID vGPU have similar performance and low overhead compared to the performance of native GPU, which makes both good choices for machine learning applications in virtualized cloud environments. For applications that require short training times and use multiple GPUs to speed up machine learning tasks, DirectPath I/O is a suitable option because this solution supports multiple GPUs per VM. In addition, DirectPath I/O supports a wider range of GPU devices, and so can provide a more flexible choice of GPU for users.

For GRID vGPU

When each user needs a single GPU, GRID vGPU can be a good choice. This configuration provides a higher consolidation of virtual machines and leverages the benefits of virtualization:

  • GRID vGPU allows the flexible use of the device because vGPU supports both shared GPU (multiple users per physical machine) and dedicated GPU (one user per physical GPU). Mixing and switching among machine learning, 3D graphics, and video encoding/decoding workloads using GPUs is much easier and allows for more efficient use of the hardware resource. Using GRID solutions for machine learning and 3D graphics allows cloud-based services to multiplex the GPUs among more concurrent users than the number of physical GPUs in the system. This contrasts with DirectPath I/O, which is the dedicated GPU solution, where the number of concurrent users are limited to the number of physical GPUs.
  • GRID vGPU reduces administration cost because its deployment and maintenance does not require server reboot, so no down time is required for end users. For example, changing the vGPU profile of a virtual machine does not require a server reboot. Any changes to DirectPath I/O configuration requires a server reboot. GRID vGPU’s ease of management reduces the time and the complexity of administering and maintaining the GPUs. This benefit is particularly important in a cloud environment where the number of managed servers would be very large.

Conclusion

Our tests show that virtualized machine learning workloads on vSphere with vGPUs offer near bare-metal performance.

References

  1. Jaffe, D. Big Data Performance on vSphere 6. (August 2016). http://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/bigdata-perf-vsphere6.pdf.
  2. Zaremba, W., Sutskever,I., Vinyals, O.: Recurrent Neural Network Regularization. In: arXiv:1409.2329 (2014).
  3. Taylor, A., Marcus, M., Santorini, B.: The Penn Treebank: An Overview. In: Abeille, A. (ed.). Treebanks: the state of the art in syntactically annotated corpora. Kluwer (2003).
  4. Tensorflow Homepage, https://www.tensorflow.org
  5. Vu, L., Sivaraman, H., Bidarkar, R.: GPU Virtualization for High Performance General Purpose Computing on the ESX hypervisor. In: Proc. of the 22nd High Performance Computing Symposium (2014).
  6. Walters, J.P., Younge, A.J., Kang, D.I., Yao, K.T., Kang, M., Crago, S.P., Fox, G.C.: GPU Passthrough Performance: A Comparison of KVM, Xen, VMWare ESXi, and LXC for CUDA and OpenCL Applications. In: Proceedings of 2014 IEEE 7th International Conference on Cloud Computing (2014).
  7. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. In: Proceedings of the IEEE, 86(11):2278-2324 (November 1998).
  8. Multiple Layers of Features from Tiny Images, https://www.cs.toronto.edu/~kriz/cifar.html
  9. Pandey, A., Vu, L., Puthiyaveettil, V., Sivaraman, H., Kurkure, U., Bappanadu, A.: An Automation Framework for Benchmarking and Optimizing Performance of Remote Desktops in the Cloud. In: To appear in Proceedings of the 2017 International Conference on High Performance Computing & Simulation (2017).

Machine Learning on VMware vSphere 6 with NVIDIA GPUs

by Uday Kurkure, Lan Vu, and Hari Sivaraman

Machine learning is an exciting area of technology that allows computers to behave without being explicitly programmed, that is, in the way a person might learn. This tech is increasingly applied in many areas like health science, finance, and intelligent systems, among others.

In recent years, the emergence of deep learning and the enhancement of accelerators like GPUs has brought the tremendous adoption of machine learning applications in a broader and deeper aspect of our lives. Some application areas include facial recognition in images, medical diagnosis in MRIs, robotics, automobile safety, and text and speech recognition.

Machine learning workloads have also become a critical part in cloud computing. For cloud environments based on vSphere, you can even deploy a machine learning workload yourself using GPUs via the VMware DirectPath I/O or vGPU technology.

GPUs reduce the time it takes for a machine learning or deep learning algorithm to learn (known as the training time) from hours to minutes. In a series of blogs, we will present the performance results of running machine learning benchmarks on VMware vSphere using NVIDIA GPUs.

This is episode 1. Also see:

Episode 1: Performance Results of Machine Learning with DirectPath I/O and NVIDIA GPUs

In this episode, we present the performance results of running machine learning benchmarks on VMware vSphere with NVIDIA GPUs in DirectPath I/O mode and on GRID virtual GPU (vGPU) mode.

Continue reading