This article directs you to a recent webinar that VMware produced on the topic of executing distributed machine learning with TensorFlow and Horovod running on a set of VMs on multiple vSphere host servers.
Many machine learning problems are tackled using a single host server today (with a collection of VMs on that host). However, when your ML model or data grows too large for one host to handle, or your GPU power happens to be dispersed across several physical host servers/VMs, then distribution is the mechanism used to tackle that scenario.
The VMware webinar introduces the concepts of machine learning in general first. It then gives a short description of Horovod for distributed training and explains the importance of low latency networking between the nodes in the distributed model, based here on Mellanox RDMA over Converged Ethernet (RoCE) technology.
Source: “Model Parallelism in Deep Learning is NOT what You Think” by Salya Ekanayake
One approach to distributing the ML training work is called “data parallelism” where a subset of the overall training data is given to each machine and that machine trains a common model on that part of the data, independently of the other machines in the distributed cluster. Subsequently, the reconciliation of the results of model training (weights) between all the machines is required. Network communication between the participating machines is an important performance concern at that point.
Another approach is “model parallelism”, where parts of your model are separated from each other and distributed to different machines for execution. This brings with it the issue of interdependencies between the parts of your model, which must be carefully considered.
We used the first approach (data parallelism) in this proof-of-concept work in the lab. A set of VMware vSphere host servers have virtual machines running on them (and Kubernetes-managed containers within those virtual machines) and each host server has a single T4 GPU installed in it. That GPU is presented fully to a single virtual machine and container as a virtual GPU or vGPU, using NVIDIA’s vGPU software. There are mechanisms (including Bitfusion and NVIDIA vGPUs) for sharing a physical GPU among different virtual machines on vSphere, but in this initial set of tests, the GPUs were not shared.
One of the most popular frameworks for machine learning development is the TensorFlow platform developed by Google. There have been several developments in the area of enhancing TensorFlow to add distribution to it over the last few years. One of the most promising offerings in distributing TensorFlow, the Horovod framework (developed by UBER), is used in this work to handle the communication between the distributed nodes, which is done in a Message-Passing Interface (MPI) manner. The Horovod functionality is used particularly to solve the problem of reconciling the gradients or weights determined by each training participant/node in the neural network used. The RESNET-50 neural network was used for testing here.
With the addition of Bitfusion technology on the client-side VMs and vGPU technology on the server-side VMs, then many high-value vSphere features (such as vMotion) can be enabled using this architecture. This topic is introduced at the end of the webinar.
You can also read further about this work in a recent blog article and get further details on deploying GPUs on vSphere for ML
For detailed step-by-step technical guidelines on this work, consult the Reference Design Guide by Mellanox
Co-author: Mohan Potheri, VMware Technical Marketing team