In Part 1 of this blog series we introduced distributed machine learning and the components of the solution. In this part 2 we will look at the solution infrastructure, distributed training comparison between PVRDMA and TCPIP and then conclude.
High Performance Virtual Infrastructure for Distributed ML
vSphere supports virtualization of the latest hardware from NVIDIA the A100 GPUs and Mellanox with their Connect X-6 RoCE. There is a potential to combine the benefits of vSphere with the capabilities of these type of high-performance hardware accelerators for Horovod based machine learning and build a compelling solution.
Infrastructure Components of the solution:
The infrastructure components are as shown. The cluster contains four Dell R740-XD vSphere hosts with one NVIDIA A100 GPU and a 100 Gbps Mellanox Connect X-6 Ethernet card each.
Figure 6: Logical schematic showing solution components
The hosts have been setup with the NVIDIA A100 GPU in pass-through mode. PVRDMA based high speed networking also has been enabled.
Table 1: vSphere Infrastructure Components used in the solution
Ubuntu 20.04 based virtual machines were setup as the training nodes was established. Each of the worker nodes has a passthrough GPU representing an entire NVIDIA A100 processor and are separated across the four physical nodes. PVRDMA based networking has been setup to provide for 100 Gbps capabilities for the virtual machines
Table 2: Software components used in the solution
The solution used the components shown in Table 1 for the Horovod training exercise. PyTorch was the machine learning platform. MPI is leveraged for PVRDMA based training.
Testing
Training Dataset
A large dataset with millions of images were used for the training exercise. Details about the ImageNet training for image recognition are shown in table 2. All classes with less than 500 images were omitted and 10,450 classes were trained for the model. The training dataset was split to include close to 11 million images for training and 500,000 images for validation. Each training run is represented by an Epoch with a total of 48 Epochs of training. Appendix A shows the Python code used for the training exercise.
Table 3: Dataset Training details
Results:
The infrastructure was used to run Horovod for training the models with a large image dataset. The training was done leveraging the ImageNet21K model as the baseline and was run with and without RDMA for 1, 2 and 4 nodes combinations. The images/second and the training time per Epoch were measured during the training runs.
Figure 8: Image processing throughput with Horovod based ML
The results show that when the nodes communicate via PVRDMA there is linear scalability in performance as seen in the images/sec processed. The overhead associated with TCP results in very poor scalability as the number of GPUs increase as seen.
Figure 9: Training times per Epoch for PVRDMA vs TCPIP
This figure represents the same data in the form of training times. We see that PVRDMA times are much more scalable and lower (faster) than that of regular TCPIP.
Key Takeaways:
The solution clearly demonstrated the following:
- vSphere supports accelerators like NVIDIA GPUs with RDMA based networking
- PVRDMA with Mellanox Connect-X6 provides shared high speed & low latency networking
- Horovod can leverage run effectively on vSphere, vGPU & PVRDMA capabilities
- Results shows PVRDMA consistently outperforms TCPIP in distributed ML Training
- Running distributed machine learning can greatly improve efficiency and reduce the need for multiple GPUs in a single machine