Artificial Intelligence Big Data Analytics Deep Learning HPC Machine Learning ML - AI Authors

Network Attached AI with Bitfusion, VMware and Mellanox

This blog was originally written by Subbu Rama with Bitfusion Nov 13, 2018

With Bitfusion, VMware and Mellanox, GPU accelerators can now be part of a common infrastructure resource pool, available for use by any virtual machine in the data center in full or partial configurations, attached over the network. The solution works with any type of GPU server and any networking configuration such as TCP, RoCE or InfiniBand. IT can now pool together resources and offer an elastic GPU as a service –  much like network attached storage, enabling dynamic assignment of GPU resources based on an organization’s business needs and priorities.

 

 

Mellanox and Bitfusion set the infrastructure configuration as shown in Figure 1 to emulate a real-life Elastic AI Infrastructure. The test bed included a cluster of Dell R740 GPU servers and Dell R640 CPU servers (no GPUs), Mellanox SN2700 100GbE switch and Mellanox ConnectX5 cards. On the clients, VMWare VSphere ESX 6.5 was setup along with Ubuntu 16.04 for the VM operating system, CUDA 9.1, CuDnn 7.3 and TensorFlow 1.9.

Bitfusion FlexDirect runs in the user space and doesn’t require any changes to the OS, drivers, kernel modules or AI frameworks. It’s worth noting that FlexDirect can also support a heterogeneous cluster with hybrid operating systems, so for instance a cluster can have FlexDirect client run on, say, Ubuntu, and have that connect to a FlexDirect server on, say, CentOS (and vice versa).

 

 

Figure 2a shows the measurement of the performance for remote attach of GPUs over the network compared to running the same workload locally on the GPU system, over 100Gbps RoCE. Figure 2b shows the same for 10Gbps RoCE. Figure 3a shows the measurement of performance of multiple network attached fractional half GPUs versus using the full physical GPU, over 100Gbps RoCE. Figure 3b shows the same for 10Gbps RoCE.

 

Bitfusion FlexDirect with VMware and Mellanox demonstrates that network attached full and fractional GPUs accomplish near native performance across the suite of benchmarks.

 

 

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *