Railway in Melbourne at night, multiple exposure
Machine Learning

Distributed Machine Learning on vSphere leveraging NVIDIA vGPU and Mellanox PVRDMA

Introduction

 

While virtualization technologies have proven themselves in the enterprise with cost effective, scalable and reliable IT computing, High Performance Computing (HPC) however has not evolved and is still bound to dedicating physical resources to obtain explicit runtimes and maximum performance. VMWare has developed technologies to effectively share accelerators for compute and networking.

VMWare, NVIDIA and Mellanox have collaborated on NVIDIA vGPU integration with VMware vSphere that enables sharing of GPU across multiple virtual machines, while preserving critical vSphere features like vMotion. It is also possible to provision multiple GPUs to a single VM, enabling maximum GPU acceleration and utilization.

vSphere enables RDMA based high performance network sharing using Paravirtualized RDMA. PVRDMA also supports vSphere features like HA & vMotion.

GPUs for Machine Learning

 

With the impending end to Moore’s law, the spark that is fueling the current revolution in deep learning is having enough compute horsepower to train neural-network based models in a reasonable amount of time

The needed compute horsepower is derived largely from GPUs, which NVIDIA began optimizing for deep learning since 2012. The latest GPU architecture from NVIDIA is Turing, available with T4 as well as the RTX 6000 and RTX 8000 GPUs, which all support virtualization. .

Figure 1: The NVIDIA T4 GPU

The NVIDIA® T4 GPU accelerates diverse cloud workloads, including high-performance computing, deep learning training and inference, machine learning, data analytics, and graphics. Based on the new NVIDIA Turing architecture and packaged in an energy-efficient 70-watt, small PCIe form factor, T4 is optimized for mainstream computing environments and features multi-precision Turing Tensor Cores and new RT Cores. Combined with accelerated containerized software stacks from NGC, T4 delivers revolutionary performance at scale. (Source: NVIDIA )

Figure 2: Layered model showing NVIDIA vGPU components

NVIDIA vGPU technology enables GPU virtualization for any workload and is available through software licenses such as the NVIDIA vComputeServer. NVIDIA vComputeServer software, enables virtualize NVIDIA GPUs such as the T4 to power the more than 600 GPU accelerated applications for AI, deep machine learning, and high-performance computing (HPC) as well as the NGC containers. With GPU sharing, multiple VMs can be powered by a single GPU, maximizing utilization and affordability, or a single VM can be powered by multiple virtual GPUs, making even the most compute-intensive workloads possible. With vSphere integration, GPU clusters for compute can be managed by vCenter, maximizing GPU utilization and ensuring security.

 

High Speed Networking with PVRDMA & RoCE

 

Remote Direct Memory Access (RDMA) provides direct memory access from the memory between hosts bypassing the Operating System and CPU. This can boost network and host performance with reduced latency & CPU load while providing higher bandwidth. RDMA compares favorably to TCP/IP, which adds latency and consume significant CPU and memory resources. High Performance Computing (HPC) workloads have been traditionally run on bare-metal, non-virtualized clusters. Virtualization was often seen as an additional layer that leads to performance degradation. Performance studies have shown that virtualization has minimal impact on HPC applications.

RDMA over Converged Ethernet (RoCE) is a network protocol that allows remote direct memory access (RDMA) over an Ethernet network. There are two RoCE versions, RoCE v1 and RoCE v2. RoCE v1 is an Ethernet link layer protocol and hence allows communication between any two hosts in the same Ethernet broadcast domain. RoCE v2 is an internet layer protocol which means that RoCE v2 packets can be routed. Although the RoCE protocol benefits from the characteristics of a converged Ethernet network, the protocol can also be used on a traditional or non-converged Ethernet network. (Source: Wikipedia)

 

 Mellanox ConnectX®-5

 

Intelligent ConnectX-5 EN adapter cards introduce new acceleration engines for maximizing High Performance, Web 2.0, Cloud, Data Analytics and Storage platforms. ConnectX-5 supports dual ports of 100Gbs Ethernet connectivity, sub-700 nanosecond latency, and very high message rate, plus PCIe switch and NVMe over Fabric offloads, providing the highest performance and most flexible solution for the most demanding applications and markets. (Source: Mellanox)

 

 Need for Distributed Machine Learning with Horovod:

 

There is a lot of time pressure to reduce the time develop a new machine learning model even as the datasets grow in size. There is an increasing need to have distributed machine learning to reduce training time and model development. Horovod is an open source distributed training framework that supports popular machine learning frameworks such as TensorFlow, Keras, PyTorch and MXNet. Horovod distributed deep learning leverages a technique called ring-allreduce, while requiring minimal modification to the user code.

Figure 3: Neural Networks can benefit from the use of GPUs

 

High Performance Virtual Infrastructure for Distributed ML

 

vSphere supports virtualization of the latest hardware from NVIDIA the T4 GPUs and Mellanox with their Connect X-5 RoCE. There is a potential to combine the benefits of vSphere with the capabilities of these type of high-performance hardware accelerators for Horovod based machine learning and build a compelling solution. VMware, NVIDIA & Mellanox teamed together to develop and create a proof of concept for a High Performance Horovod based machine learning environment.

The solution was successfully deployed and is documented in details in the paper.