HPC Machine Learning vSphere

Introducing vHPC Toolkit for High Performance Computing and Machine Learning

Overview

High Performance Computing (HPC) is the use of parallel-processing techniques to solve complex computational problems. HPC systems have the ability to deliver sustained performance through the concurrent use of distributed computing resources, and they are typically used for solving advanced scientific and engineering problems, such as computational fluid dynamics, bioinformatics, molecular dynamics, weather modeling and deep learning with neural networks.

With continued efforts to improve performance to levels near that of bare-metal HPC environments, the trend toward virtualized HPC (vHPC) is rapidly growing. This is particularly true for enterprise-grade workloads like machine learning and deep learning.

Due to their extreme demand on performance, HPC workloads often have much more intensive resource requirements than those workloads found in the typical enterprise. For example, HPC commonly leverages hardware accelerators, such as GPU and FPGA for compute as well as RDMA interconnects for fast communication, which require special vSphere configurations.

This toolkit is intended to facilitate managing the lifecycle of these special configurations by leveraging vSphere APIs. It also includes features that help vSphere administrators perform some common vSphere tasks that are related to creating such high-performing environments, such as VM cloning, setting Latency Sensitivity, and sizing vCPUs, memory, etc.

Open Source/Flings

This toolkit is one of VMware open source projects under Apache 2 license. It can be obtained at VMware Github vmware/vhpc-toolkit. The toolkit can be also linked through VMware OCTO Flings program – Virtualized High Performance Computing Toolkit.

How Does vHPC Toolkit Work?

This toolkit currently supports vSphere version 6.5 or 6.7 and requires Python 3. Follow the instruction in README to pip install all required packages and set up vCenter IP/FQDN in a vCenter.conf file. After properly setting the vCenter.conf file, you should be able to execute vhpc_toolkit under bin folder to enter interactive shell and perform all available operations. For example,

There are two categories of functions in this toolkit: (1) configuration of vHPC environments; (2) vHPC cluster creation and destruction using a configuration file.

Configuration of vHPC Environments

Using this toolkit, we can easily apply the following operations to a single VM or a list of VMs:

  • Perform common vSphere tasks, such as cloning VMs, configuring vCPUs, memory, reservations, shares, Latency Sensitivity, Distributed Virtual Switch/Standard Virtual Switch, network adapters and network configurations
  • Configure PCIe devices in DirectPath I/O mode, such as GPU, FPGA and RDMA interconnects
  • Configure NVIDIA vGPU
  • Configure RDMA SR-IOV (Single Root I/O Virtualization)
  • Configure PVRDMA (Paravirtualized RDMA)

Below illustrates the usage of some commands.

Clone and Customize VM

Clone multiple VMs based on a template named “vhpc_clone” with specified CPU and memory customization:

where VM-file is name of the file containing a list of cloned VM names, one per line.

Configure GPU DirectPath I/O (Passthrough)

Add GPU device 0000:84:00.0 in Passthrough mode into each above cloned VM:

where “0000:84.00” is the SBDF address (segmentBus:device.function) for the GPU device. This value can be found at “Host” -> “Configure” -> “Hardware” -> “PCI Devices” in vCenter.

Configure NVIDIA vGPU

Or add NVIDIA vGPU with vGPU profile grid_p100-4q (NVIDIA P100) into each cloned VM:

where the profile represents the vGPU type and “4q” refers to the vGPU’s memory size 4GB.

Configure CPU/Memory Reservation and Latency Sensitivity

The above three commands reserve CPUs as well as memory and set “Latency Sensitivity” to “High” for each VM in the VM-file.

Execute Post Scripts in Guest OS

It will prompt you guest OS password for executing the installation script. This function helps facilate some guest OS customization after provisioning VMs.

For more examples, please refer to sample operations in the project docs.

vHPC Cluster Creation and Destruction using a Configuration File

This function can help vSphere administrators create/destroy virtual HPC clusters using a cluster configuration file as input.

For example, create a cluster based on the cluster configuration file “cluster.conf”:

Similarly, destroy the cluster:

The cluster configuration file allows you to easily define an HPC/ML cluster with VMs with all kinds of special attributes. Here is a sample cluster configuration file:

You can define different virtual clusters with a variety of configurations, including VMs with GPU Passthrough, InfiniBand Passthrough/SR-IOV, RoCE (RDMA over Converged Ethernet) Passthrough/SR-IOV/PVRDMA. For details on the syntax and more sample files of defining different virtualized HPC/ML clusters, you are welcome to read the README and sample cluster configuration files in the project.

Extensibility

The toolkit is also built with extensibility in mind. It is easy to add additional operations that are currently not supported.

Call for Participation

Feel free to try out the tool and, as always, we strongly encourage you to report bugs and suggest improvements. We also welcome contributions to the tool from the community!