Machine Learning on vSphere 6 with Nvidia GPUs - Episode 2

By Hari Sivaraman, Uday Kurkure, and Lan Vu

In a previous blog [1], we looked at how machine learning workloads (MNIST and CIFAR-10) using TensorFlow running in vSphere 6 VMs in an NVIDIA GRID configuration reduced the training time from hours to minutes when compared to the same system running no virtual GPUs.

Here, we extend our study to multiple workloads—3D CAD and machine learning—run at the same time vs. run independently on a same vSphere server.

This is episode 2 of a series of blogs on machine learning with vSphere. Also see:

Performance Impact of Mixed Workloads

Many customers of the NVIDIA GRID vGPU solution on vSphere run 3D CAD workloads. The traditional approach to run 3D CAD and machine learning workloads, typically, is to run the CAD workloads during the day and the machine learning workloads at night, or to have separate infrastructures for each type of workload, but this solution is inflexible and can increase deployment cost. We show, in this section, that this kind of separation is entirely unnecessary. Both workloads can be run concurrently on the same server, and the performance impact on the 3D CAD workload as well as on the machine learning workload is negligible in three out of four vGPU profiles.

vSphere 6.5 supports four vGPU profiles, and the primary difference between them is the amount of VRAM available to each VM:

M60-1Q: 1GB VRAM
M60-2Q: 2GB VRAM
M60-4Q: 4GB VRAM
M60-8Q: 8GB VRAM

In this blog, we characterize the performance impact of running 3D CAD and machine learning workloads concurrently using two benchmarks. We chose the SPECapc for 3ds Max 2015 [2] benchmark as a representative for 3D CAD workloads and so did not comply with the benchmark reporting rules, nor do we use or make comparisons to the official SPECapc metrics. We chose MNIST [3] as a representative for machine learning workloads. The performance metric in this comparison is simply the run time for the benchmark.

Our results show that the performance impact on the 3D CAD workload due to sharing the server and GPUs with the machine learning workload is below 5% (in the M60-2Q, M60-4Q, and M60-8Q profiles) when compared to running only the 3D CAD workload on the same hardware. Correspondingly, the performance impact on the machine learning workload when sharing the hardware resources with the 3D CAD workload compared to running all by itself is under 15% in the M60-2Q, M60-4Q, and M60-8Q profiles. In other words, the run time for the 3D CAD benchmark increases by less than 5% when sharing the hardware with the machine learning workload when compared to when it does not share the hardware. The increase in run time for machine learning was under 15% when sharing compared to not sharing the hardware.

Experimental Configuration and Methodology

We installed the 3D CAD benchmark in a 64-bit Windows 7 (SP1) VM with 4 vCPUs and 16GB RAM. The benchmark uses Autodesk 3ds Max 2015 software. The NVIDIA vGPU driver (#369.4) was used in the VM. We configured the vGPU profiles as M60-1Q, M60-2Q, M60-4Q, or M60-8Q for different runs. We used this VM as the primary one from which we made linked clones so that we could run the 3D CAD benchmark at scale with 1, 2, 4, …, 24 VMs running 3D CAD simultaneously. The software configurations used for the 3D CAD workload are shown in Table 2, below.

ESXi	6.5 #4240417
Guest OS	CentOS Linux release 7.2.1511 (Core)
CUDA Driver & Runtime	7.5
TensorFlow	0.1

⇑ Table 1. Configuration of VM used for machine learning benchmarks

vGPU profile	Total # VMs running concurrently in the test
M60-8Q	3
M60-4Q	6
M60-2Q	12
M60-1Q	24

⇑ Table 2. Software configuration used to run the 3D CAD benchmark

Our experiments includes three sets of runs. In the first set, we ran only the 3D CAD benchmark for each of the four configurations listed in Table 2, above and measured the run time, the ESXi CPU utilization, and the GPU utilization. Once this set of runs was completed, we did a second set in which we ran the 3D CAD benchmark concurrently with the machine learning benchmark. To do this, we first installed the MNIST benchmark, CUDA, cuDNN and TensorFlow in a CentOS VM with the configuration shown in Table 1, above. Since CUDA only works with an M60-8Q profile, we used it for the VM that runs the machine learning benchmark. For the runs in this second set, we used the configurations shown in Table 4 and measured the run time for the 3D CAD benchmark, the run time for MNIST, the total ESXi CPU utilization, and the total GPU utilization. The server configuration used in our experiments is shown in Table 3, below.

Model	Dell PowerEdge R730
CPU	Intel Xeon Processor E5-2680 v4 @ 2.40GHz
CPU cores	28 CPUs, each @ 2.40GHz
Processor sockets	2
Cores per socket	14
Logical processors	56
Hyperthreading	Active
Memory	768GB
Storage	Local SSD (1.5TB), storage arrays, local hard disks
GPUs	2x NVIDIA Tesla M60
ESXi	6.5 #4240417
NVIDIA vGPU driver in ESXi host	367.38
NVIDIA vGPU driver inside VM	369.04

⇑ Table 3. Server configuration

vGPU profile used for 3D CAD VMs only	#VMs running 3D CAD	# VMs running MNIST	Total # VMs running concurrently in test
M60-8Q	3	1	4
M60-4Q	6	1	7
M60-2Q	12	1	13
M60-1Q	24	1	25

⇑ Table 4. Software configuration used to run the mixed workloads (please note that that the machine learning VM can be configured only using the M60-8Q profile; no other profiles are supported)

We did a third set of runs in which we ran only MNIST on the server. Since MNIST runs only in the M60-8Q profile, only one run was done in this third set.

Results

We compared the run times of the first set of runs (3D CAD benchmark only) with the ones of the second set (3D CAD and MNIST are run concurrently) as well as computed the percentage increase in the run time for 3D CAD when it shares the server with MNIST compared to when it ran without MNIST. So specifically, say, in the M60-4Q profile, we computed the percentage increase in run time for 3D+ML compared to 3D only in the M60-4Q profile. We also measured the run time of MNIST running concurrently with 3D CAD with the run time of MNIST running by itself on the server and computed the percentage increase in run time for MNIST. We call this increase in run time the performance drop or change. The computed values are shown in Figure 1 below.

⇑ Figure 1. Percentage increase in run time for 3D graphics (3D) and machine learning (ML) workloads due to running concurrently compared to running in isolation

From the figure we can see that in the M60-8Q, M60-4Q, and M60-2Q profiles, the run times for 3D CAD when it shares the server with machine learning compared to when 3D CAD runs by itself is less than 5%. For the MNIST machine learning workload, the performance penalty due to sharing compared to no sharing is under 15% in the M60-2Q, M60-4Q, and M60-8Q profiles. Only the M60-1Q profile that can support up to 24 VMs running 3D CAD and one VM running MNIST show any significant performance penalty due to sharing. Now, if the workloads were run sequentially, the total time to complete the tasks would be the sum of the run time for 3D CAD and the machine learning workloads.

A comparison of the total run time for ML and 3D CAD workloads is shown in Figure 2. From the figure, we can see that the total time to completion of the workloads is always less when run concurrently as opposed to when run sequentially.

⇑ Figure 2. It takes a longer time to sequentially run a 3D plus ML (machine learning) mixed workload when compared to the time to run a concurrent mixed workload. The original time was in seconds, but we have normalized the concurrent time to 1 so that the change in sequential time stands out.

⇑ Figure 3. CPU utilization on server for mixed workload configuration (3D+ML) and for 3D graphics only (3D)

Further, running the workloads concurrently results in higher server utilization, which could result in higher revenues for a cloud service provider. The M60-1Q profile does show a higher time to complete when workloads are run concurrently when compared to being run sequentially, but it does achieve very high consolidation (measured as number of VMs per core) and high server utilization. So, if in the M60-1Q profile, the higher time to complete the workload run can be tolerated, the configuration that runs the workloads concurrently would achieve higher revenues for a cloud service provider because of higher server utilization. The CPU utilization on the server for the M60-8Q, M60-4Q, M60-2Q, and M60-1Q profiles with only 3D CAD (3D) and with 3D CAD plus machine learning (3D+ML) are shown in Figure 3, above.

Conclusions

Simultaneously running 3D CAD and machine learning workloads reduces the total time to complete the runs with the M60-2Q, M60-4Q, and M60-8Q profiles compared to running the workloads sequentially. This is a radical departure from traditional approaches to scheduling machine learning and 3D CAD workloads.
Running 3D graphics and machine learning workloads concurrently increases server utilization, which could result in higher revenues for a cloud service provider.

References

[1] Machine Learning on VMware vSphere 6 with NVIDIA GPUs
https://blogs.vmware.com/performance/2016/10/machine-learning-vsphere-nvidia-gpus.html

[2] The MNIST Database of Handwritten Digits
http://yann.lecun.com/exdb/mnist/

[3] SPECapc for 3ds Max 2015
https://www.spec.org/gwpg/apc.static/max2015info.html

Machine Learning on vSphere 6 with Nvidia GPUs – Episode 2

Performance Impact of Mixed Workloads

Experimental Configuration and Methodology

Results

Conclusions

References

Related Articles

Episode 3: Performance Comparison of Native GPU to Virtualized GPU and Scalability of Virtualized GPUs for Machine Learning

Performance Comparison of Containerized Machine Learning Applications Running Natively with Nvidia vGPUs vs. in a VM – Episode 4

Sharing GPU for Machine Learning/Deep Learning on VMware vSphere with NVIDIA GRID: Why is it needed? And How to share GPU?