The Article is authored by Uday Kurkure<email@example.com>, Lan Vu <firstname.lastname@example.org> and Hari Sivaraman <email@example.com>
While machine learning (ML) continues its rapid advance as a source of competitive advantage, from improved quality of products and services to better business decision-making, business and IT leaders are simultaneously challenged to build and maintain the infrastructure required without dramatically increasing costs. In part one of our two-part blog post, we discussed the many benefits GPU virtualization can bring to enterprises deploying ML or high-performance computing (HPC) workloads, leveraging the thousands of parallel processing cores available to achieve the highest levels of efficiency.
In this blog, we’ll build on that discussion with a closer look at the ways in which VMware vSphere and NVIDIA vGPUs can help administrators and users get the most from their HPC and ML workloads, with features that include live migration of vGPU-enabled VMs with no downtime, vGPU sharing, and streamlined scaling of vGPU VMs to optimize infrastructure efficiency and overhead costs.
Live migration of vGPU-enabled VMs with VMware vMotion
Improving data center utilization while also providing highly available, high quality of service infrastructure to enterprise data scientists, CAD users, and those working with compute-intensive ML models is a mission-critical objective for IT administrators. Functions like workload levelling, reinforcing infrastructure for greater resilience, and performing software patches or upgrades must be undertaken with the smallest possible impact to end user productivity.
Using VMware vMotion, administrators can live migrate vGPU-enabled VMs with little to no discernible impact to end users, with stun time (i.e., the period a VM is inaccessible during live migration) amounting to a matter of seconds, depending on the size of the memory allocated from high-speed DRAM memory on the graphics card.
To demonstrate, VMware created a test bed Dell PowerEdge R730 server, Intel Broadwell-based CPU, and NVIDIA Tesla P40 GPU. The NVIDIA Tesla P40 is designed specifically to support ML/DL workloads, with maximum throughput for improved inference capability. To simulate the live migration effects of vGPU-enabled VMs running computationally demanding workloads, we deployed the SPECapc 3D Max 2015 benchmark across multiple P40 vGPU profiles. The benchmark features 48 tests for comprehensive measurement modeling, interactive graphics, visual effects, and CPU/GPU performance.
In Figure 1 below, we can see the end-to-end latency impact of performing vMotion live migrations for different vGPU profiles. Figure 2 indicates latency impact while performing six vMotion migrations concurrently for two different vGPU profiles. The data resulting from our test suggests the impact to performance of compute-intensive applications is marginal during live migration, irrespective of the diversity of vGPU profiles during the simultaneous live migration of vGPU enabled VMs.
Figure 1. vMotion Migration of Different vGPUs Running SPECapc
Figure 2. Six Concurrent vMotion Migrations Running SPECapc
Scaling vGPUs for machine learning workloads
With non-disruptive live migration of vGPUs with VMware vMotion and vGPU sharing enabled by NVIDIA and VMware vSphere, administrators can effectively increase utilization of vGPUs while helping to manage their infrastructure costs downward. Most end users don’t access GPU resources around the clock, so sharing of vGPU profiles according to ML workload is critical to optimizing for utilization.
As a rule, it’s recommended to use lower vGPU profiles for small ML model sizes and use higher vGPU profiles for more demanding training jobs. Lower vGPU profiles are scalable to larger numbers of VMs (e.g., a P40-1q profile allows scaling up to 24 VMs), while the scalability of higher vGPU profiles is more limited (e.g., a P40-12q allows scaling of only 2 VMs). In a typical scenario involving vGPU sharing while running a handwriting recognition training model, training time is only marginally impacted during sharing (i.e., VM1 and VM2 in Figure 3 below) as compared to non-sharing (i.e., VM3 in Figure 3).
Figure 3. Scaling ML VMs with vGPUs in vSphere – typical cases of GPU sharing
Administrators can choose from one of three policies for scheduling vGPU utilization: best effort, equal share, and fixed share. Somewhat intuitive in naming, best effort scheduling shifts a GPU to the next VM when the VM has no GPU tasks. Under best effort policy, GPU resources are not guaranteed, nor are those resources equally shared across VMs.
Conversely, fixed and equal share scheduling reserves resources for each VM respectively, making equal sharing of resources possible. Equal share scheduling allocates time slots to powered “on” VMs. In fixed share scheduling, the time slot is determined by the vGPU profile associated with the GPU. The time slot is allocated to the VM even if the VM is not powered “on,” ensuring fixed quality of service guarantees.
It follows that employing best effort scheduling reduces training times and increases GPU utilization. However, quality of service and “noisy neighbor” effects (i.e., occurring when a VM uses the majority of available resources, causing network performance issues for other VMs) are possible under this policy. When quality of service is critical, administrators should rely on an equal or fixed share policy.
Figure 4 below shows the impact on GPU utilization for best effort and equal share policies, while Figure 5 indicates the impact to training time for both policies across a wide range of vGPU profiles.
Figure 4. Best Effort vs Equal Share Scheduling – GPU Utilization
Figure 5. Best Effort vs Equal Share Scheduling – Training Time
Scaling with diverse virtual machines
To further illustrate the value of offloading compute to GPU resources, CPU resources become significantly more available when offloading compute-intensive applications (see Figure 6 below).
Figure 6. CPU Utilization for Machine Learning Workloads with GPUs
In this case, again relying on the Dell 740 PowerEdge server test bed utilized in previous tests above, we measured the impact on system performance while running an ML VM, a 3D-CAD VM, and varying range of knowledge worker VMs concurrently. As the ML training workload is a batch job tolerant of higher latencies, we’re able to employ up to 96 knowledge worker VMs with only marginal impact to our CAD interactive workload (see Figure 7 below). Similarly, when sharing a host with ML and CAD workloads, latencies for knowledge worker VMs increased less than 0.5% as VMs scaled from 32 to 96 VMs.
Meanwhile, Figure 8 demonstrates that CPU utilization improves from 3% to 68% when employing 96 VMs, leaving plenty of room to grow. The ability to scale diverse VMs running equally diverse workloads enables greater consolidation of VMs per physical host, delivering additional cost savings to data center operators.
Figure 7. Performance Impact of Diverse Workloads Run Concurrently (same physical host)
Figure 8. CPU Utilization Running Concurrent ML, CAD, and Knowledge Worker
Faster time to value, improved utilization, lower costs
VMware vSphere pools ML infrastructure from myriad hardware and software resources, including CPUs and GPUS, VM management and containerization applications, programming languages like CUDA and OpenCL, and training frameworks such as TensorFlow, Caffe2, Horovod and others. These resources are underpinned by a unique feature set exclusive to vSphere that provide for virtualization and management of GPUs (DirectPath I/O, BitFusion), direct support for NVIDIA vGPUs, workload balancing via VMware DRS, and rules-based autoscaling of virtualized hardware resources.
Together with our customers, we’re helping to make ML a meaningful source of business advantage. The ability to effectively analyze massive amounts of unstructured data and provide actionable intelligence helps businesses respond faster to changing business conditions, expand the universe of automatable tasks, improve network security through real-time monitoring, and deliver innovative products or services, to name only a few. VMware helps customers realize these important benefits while also optimizing their infrastructure for performance, efficiency and cost. In short, VMware vSphere combines the power of GPUs with the data center management benefits of virtualization to speed time to value and improve bottom line results.