posted

0 Comments

This is part 4 of a series of blog articles that give technical details of the different options available to you for setting up GPUs for compute workloads on vSphere.

Part 1 of this series presents an overview of the various options for using GPUs on vSphere

Part 2 describes the DirectPath I/O (Passthrough) mechanism for GPUs

Part 3 gives details on setting up the NVIDIA GRID technology for GPUs

In this fourth article in the series, we describe the technical setup process for the BitFusion FlexDirect product on vSphere. BitFusion is a partner company of VMware that produces and markets the FlexDirect product for optimizing the use of GPUs across multiple virtual machines.

Figure 1: Architecture example for a BitFusion FlexDirect Setup on Virtual Machines hosted on vSphere

 

BitFusion’s FlexDirect product increases the flexibility with which you may utilize your GPUs on vSphere. It does so by allowing your physical GPUs to be allocated in part or as a whole to applications running in virtual machines. Those consumer virtual machines may be hosted on servers that do not themselves have physical GPUs attached to them. BitFusion FlexDirect uses techniques for remoting of the CUDA instructions to other servers in order to achieve this. A more complete description of the technical features of the BitFusion product is given here

BitFusion FlexDirect may be used to dedicate one or more full GPUs to a virtual machine, or to allow sharing of a single physical GPU across multiple virtual machines. The VMs that are sharing a physical GPU in this case need not be taking equal shares in it – their share sizes can be different. The share of the GPU is specified by the application invoker at startup time. BitFusion allows the set of consumer VMs to use multiple physical GPUs at once from one virtual machine.

The BitFusion FlexDirect Architecture

 

BitFusion FlexDirect uses a client-server architecture, as seen in figure 1, where the server-side VMs provide the GPU resources, while the client-side VMs provide the locations for end-user applications to run.  The server-side GPU-enabled VMs are referred to as the “GPU Cluster”. An individual node/VM may play both roles and have client and server-side execution capability locally, if required.

The client-side and server-side VMs may also be hosted on different physical servers as shown above. The VMs can be configured to communicate over a range of different types of network protocols, including TCP/IP and RDMA. RDMA may be implemented using Infiniband or RoCE (RDMA over Converged Ethernet).  These different forms of networking have been tested by VMware and BitFusion engineers working together on vSphere and the results of those tests are available here.

The BitFusion software can be used to remove the need for physical locality of the GPU device to the consumer – the GPU can be remotely accessed on the network. This approach allows for pooling of your GPUs on a set of servers as seen in figure 1. GPU-based applications can then run on any node/VM in the cluster, whether it has a physical GPU attached to it or not.

In the example architecture shown in figure 1, we show one virtual machine per ESXi host server for illustration purposes. Multiple virtual machines of these types can live on the same host server.  In the case of the server-side host servers, if there are multiple GPU cards on that host, then that can be accommodated also.

FlexDirect Installation and Setup

 

For the server-side VMs shown in figure 1, i.e. those on the GPU-bearing hosts, the access to the GPU card by the local VM is done using the Direct Path I/O (i.e. the Passthrough method) that was fully described in the second blog article in this series or by using the NVIDIA GRID vGPU method described in the third article in the series.

The details of these methods of GPU use on vSphere will not be repeated here. We will assume that one of these setups has already been done for any GPU that is participating in the BitFusion installation.

The product that is installed onto the guest operating system of client-side and server-side VMs for a BitFusion setup is called “FlexDirect”. This software operates in user mode within the guest operating system of the VMs and needs no special drivers.

The FlexDirect Installation Process

BitFusion has a client-side and a server side FlexDirect process and helpers that are installed as follows. These installation steps are also given here.

1. Ensure you have the appropriate BitFusion License Key

If you do not have a current license key, then contact BitFusion to acquire one.

 

2. Install the FlexDirect CLI Program in your Ubuntu or CentOS Linux VM (you will need internet access for this command):

wget -O – getflex.bitfusion.io | sudo bash

 

This command downloads the shell script to install the FlexDirect material and passes that script to be executed in a shell by the root user.

 

3. Check the FlexDirect Location

To ascertain where the FlexDirect CLI program has been installed to, use the command

which flexdirect

 

Output

/usr/bin/flexdirect

 

4. Initialization

To initialize the Flexdirect system, issue the following command. You may be asked to enter the software license key you have previously acquired at this point. If you have a license key, enter it when prompted. If you don’t yet have license key, then you may email support@bitfusion.io to request access.

sudo flexdirect init

 

Output

License has been initialized Attempting to refresh…

Refresh successful. License is ready for use.

Flexdirect is licensed and is ready for use.

 

5. Launch the Flexdirect Resource Scheduler on the GPU-enabled Servers

 

To launch the Flexdirect Resource Scheduler daemon on the GPU servers in your cluster, use the following command on all the server-side VMs that have a GPU on their host:

nohup flexdirect resource_scheduler  &> /dev/null &

 

Example Output

Flexdirect resource

Running resource scheduler on 0.0.0.0:56001 using 1 GPUs (0) with 1024 MiB of memory each

The FlexDirect daemon provides the resource scheduling function for GPU users on your cluster and is known also as SRS.

The syntax of the various parameters to the “flexdirect” command are given in the BitFusion Usage site

You may also get help on the “flexdirect” command, used on its own, to see the various parameters and options available:

flexdirect

 

To see the GPUs that are available, either from a suitably configured client-side VM or server-side VM, type the command

flexdirect list_gpus

 

Example output for a single server VM with one GPU enabled is show in figure 2 below

Figure 2: Example output from the “flexdirect” command to show the available GPUs

 

6. Test an Application on the BitFusion FlexDirect Server-side

You may now execute the FlexDirect program on the GPU-enabled VM with a named application as a client would, in order to test it. An example of such a test command, using the nvidia-smi test program is:

flexdirect run -n 1 -m 2048 nvidia-smi |more

 

Figure 3: The “flexdirect run” command showing output from a health check run

You may replace the “nvidia-smi” string in the above command with your own application’s executable name in order to have it run on the appropriate number of GPUs (-n parameter) using the suitable GPU memory allocation (-m parameter). You may also choose to use a fraction of a GPU by issuing a command such as

flexdirect local -p 0.5 -m 1024 nvidia-smi |more

 

where the parameter “-p 0.5” indicates a partial share of one half of the GPU power for this application.

 

7. Execute Health Checks

You can also get a reading on how healthy the server-side process is using the command:

flexdirect localhealth

Figure 4: Output from the “flexdirect localhealth” command

The “flexdirect health” command may also be executed on a client-side VM, that does not have a GPU attached, in the same way, producing a more concise output. Note the hostname of the VM here includes “cpu” indicating it does not have a GPU attached to it in the way that the server-side “gpu” named VM does.

Figure 5: Executing the “flexdirect health” command on the Client-side VM

 

8. Get Data on the GPU and Driver

To get a very concise view of the health and the Driver and the GPU state, use the command:

flexdirect smi

 

Figure 6: The “flexdirect smi” command output

 

9. Install the FlexDirect Daemon on the Client-side VMs

You install the client side of the BitFusion FlexDirect product using the same commands as for the Flexdirect server-side VMs:

wget -O – getflex.bitfusion.io | sudo bash

sudo flexdirect init

 

10. Configure the Client-side VM for Access to the Server-side GPU VMs

Once the FlexDirect software is installed, you apply the license as before and then configure an entry in the /etc/bitfusionio/servers.conf file on the client-side to contain the IP address or hostname of the server-side VM that this client VM will talk to. This configuration file should have one IP address or hostname per line and there may be several of these server VMs mentioned in that file.

Figure 7: A basic example of the /etc/bitfusionio/servers.conf file on the client-side VM

 

11. Test the Connection between the Client and Server-side VMs

You can now run your own application using the flexdirect program to execute the GPU-specific parts on the server side or use the command we tried previously on the server side to execute the nvidia-smi tool:

flexdirect run -n 1 -m 2048 nvidia-smi

to produce the same result.

Figure 8: Client-side VM health check output using the “flexdirect” and “nvidia-smi” commands

You can now use any portion of a GPU or a set of GPUs on the server VMs to execute your job on. You can execute the application from your client-side or server-side VMs, as seen earlier. The client-side VMs can be moved from one vSphere host server to another using vSphere vMotion and DRS.

If your server-side VMs are hosted on host servers with vSphere 6.7 update 1 or later, and are configured using NVIDIA GRID, then they may also be moved across their vSphere hosts using vMotion.

 

Conclusion

This article demonstrates the setup and initial testing process for the BitFusion FlexDirect system for use of GPUs on virtual machines running on vSphere. Using FlexDirect, a set of non-GPU enabled virtual machines (client VMs) may make use of sets of GPU-enabled virtual machines that may be remote from the clients across the network. The GPU devices no longer need to be local to their consumers.

The FlexDirect software allows applications to make use of different shares of a single GPU or multiple GPUs at once. These combinations of GPU usage have been tested in the engineering labs at VMware and their performance on that infrastructure has been thoroughly documented in the references given below. The FlexDirect method is an excellent choice among the various options for GPU use on vSphere, as documented in this series of articles.

References

BitFusion Documentation

Machine Learning leveraging NVIDIA GPUs with BitFusion on VMware vSphere – Part 1

Machine Learning leveraging NVIDIA GPUs with BitFusion on VMware vSphere – Part 2

Using GPUs with Virtual Machines on vSphere – Part 1: Overview

Using GPUs with Virtual Machines on vSphere – Part 2: DirectPath I/O

Using GPUs with Virtual Machines on vSphere – Part 3: NVIDIA GRID