posted

0 Comments

In this two part series, we will look at leveraging NVIDIA GPU for Machine Learning in vSphere virtualized environments leveraging Bitfusion. In this first part, will discuss the options for sharing NVIDIA GPU, the setup of the infrastructure and the proposed test cases. In part two we will look at the results of the testing.

Introduction

GPUs are increasingly becoming an important part of compute infrastructure. Bare-metal environments for GPUs could lead to underutilization and inefficient use. Virtualization benefits can be extended to share GPUs across virtual machines, while not sacrificing performance.

There is an increasing need in Machine Learning to use GPUs to speed up the processing of large computations. Traditionally GPUs have been dedicated to users, leading to the lack of flexibility and to reduced utilization. Virtualization, which has a reputation of facilitating sharing of hardware resources, can also be leveraged for sharing GPUs. In this study we will look at the performance impact of sharing GPUs.

Sharing NVIDIA GPUs in VMware Virtualized Environments

NVIDIA GPUs can be currently shared in two different ways in VMware virtual environments:

  1. NVIDIA GRID vGPU

NVIDIA virtual GPU provides the capability for organizations to virtualize GPU workloads in modern business applications. NVIDIA GRID software helps share the power of GPUs across multiple Virtual Machines. NVIDIA Grid with GPUs can be used for simultaneous access by multiple deep learning applications and users. Note that while GPU memory is partitioned to create portions of GPUs, each portion of GPU is given time-sliced access to all of a GPU’s compute resources. This allows for better utilization of GPU resources in virtualized environments.

  1. Bitfusion

Bitfusion on VMware vSphere provides the capability to share partial, full or multiple GPUs. Bitfusion on VMware vSphere makes GPUs a first-class resource that can be abstracted, partitioned, automated and shared much like traditional compute resources. GPU accelerators can be partitioned into multiple virtual GPUs of any size and accessed remotely by VMs, over the network. With Bitfusion, GPU accelerators are now part of a common infrastructure resource pool and available for use by any VM in a vSphere based environment.

In this study, we will leverage Bitfusion FlexDirect for sharing GPUs on vSphere to run machine learning applications.

Bitfusion VMWare Reference Architecture

Bitfusion is used as the sharing mechanism for GPUs across virtual machines. Bitfusion uses a client server architecture. The Bitfusion server component runs in a virtual machine is located on the  physical server containing the NVIDIA GPU cards. The client virtual machine for Bitfusion can run on the same ESX host as the Bitfusion server or on a virtual machine located in the virtual infrastructure and connected via high speed Ethernet network or RDMA based interconnects.

Bitfusion FlexDirect Client runs as a user space application within each VM instance, without the need for change or special software in the ESXi hypervisor or the AI applications. On the GPU accelerated server VM, FlexDirect also runs as a transparent software layer, and exposes the individual physical GPUs as a pooled resource to be consumed by client VMs (VMs don’t need to have GPUs). Bitfusion FlexDirect can be used to allocate GPU resources and dynamically attach them over the network. Upon completion of the application, the shared GPU resources are put back into the resource pool.

Figure 1: Bitfusion FlexDirect and VMware Reference Architecture

Use Cases for Bitfusion FlexDirect based solution on vSphere

Bitfusion use-cases on vSphere can be broadly categorized into 3 types.

  1. Dynamic and Remote Attached GPUs

Bitfusion FlexDirect allows remote attach of GPUs dynamically to client VMs, as shown in Figure 2. GPUs can also be dynamically detached after use.

 

Figure 2: Dynamic and Remote Attached GPUs

  1. Partial GPUs

Bitfusion FlexDirect can be used to slice GPUs to non-equal parts of partial GPUs. This serves as an optimal architecture for machine learning, in which each user/workload type is unpredictable and requires non-equal performance and response time. The GPUs are sliced with GPU memory as a proxy. For instance, say there is a GPU with 16GB of GPU memory, one could create multiple partial GPUs namely two 4GB partial GPUs and four 2GB partial GPUs using FlexDirect. This allows sharing the same GPU across multiple users in a multi-tenant environment, as shown in Figure 3.

Figure 3: Bitfusion FlexDirect Partial GPUs

  1. Dynamic and Remote Attached Partial GPUs

Bitfusion FlexDirect can also be leveraged to remotely attach partial GPUs dynamically. A group of GPUs can be re-configured to partial GPUs of any size and combination, and can be remotely attached to client VMs, as shown in Figure 4.

Figure 4: Bitfusion FlexDirect Remote Partial GPUs

Testing Methodology

In this study, we have carried out some testing to show the results of running machine learning workloads remotely using a full GPU and shared GPUs with Bitfusion. This is then compared to workloads run locally without Bitfusion. We have tested the following cases:

  • Baseline tests are run in a VM with a GPU directly configured in Passthrough mode. This is the baseline case, as it is the best case scenario with no Bitfusion remote networking related overhead.
  • Local tests are Bitfusion client and server running in the same GPU VM . This is the local use case, as it is the next best thing to the native test case with direct backplane connectivity for the client/server communications.
  • In local partial tests there are multiple Bitfusion clients and server running in the same GPU VM. Each client is configured to use GPU with equal share.
  • In the remote test case, the client VM is on a different ESX host from the server VM and is connected via 100 Gb/s RoCE (RDMA over Converged Ethernet) or 10Gb/s Ethernet vmxnet3 and the GPU is configured in Passthrough mode. The two different connections and their respective performance were captured.
  • Remote partial tests use multiple VMs with each running a client to share a GPU remotely and equally.

The test cases are summarized in Table 1. For test case 3 and 4, tests were also run for vmxnet3 with latency sensitive tunings. For partial test case 2 and 4, the performance of each client sharing GPU is aggregated to compare with baseline case.

Table 1: Bitfusion Test Cases

Test Infrastructure

The test infrastructure consisted of a four node Dell R730 high performance cluster with 1 NVIDIA P100 card each configured in pass-through mode. The cluster nodes are attached via a Brocade 32 GBPS Fibre Channel fabric to a Pure Storage M50 array, which is used as the primary VMFS storage for all virtual machines. The Ubuntu virtual machines used in the testing require a high performance NFS storage for file sharing and coordination. A Pure FlashBlade array with two 40 GBPS uplinks was used as the NFS repository for the work.


Table 2: Test Hardware Components


Table 3: Test Software Components

We have looked at the deployment details of the machine learning components leveraging TensorFlow and NVIDIA GPU in a virtual environment in this part 1. In part 2 we will look at running the Machine Learning benchmark, the testing and the results.

Authors: Na Zhang, Mohan Potheri