Introduction
Modern data scientists need diverse sets of tools and infrastructure in order to deliver valuable insights to their organizations. IT infrastructure has struggled to meet the unique needs of data scientists. Because of the increasing importance of data science to business, IT infrastructure needs to evolve rapidly to meet their complex requirements. Infrastructure needs to evolve and bring agility and flexibility to make the data scientist productive through the use of modern developer platforms like Kubernetes. IT should also be able to provide access to specialized HW like GPUs and other accelerators to help data scientists be productive.
Kubernetes evolving as the leading development platform
Enterprise are shifting their focus from infrastructure to application development. It’s imperative that companies increase developer productivity, shorten the path to production, and accelerate the cadence of new features and services. Easy access to resources is crucial for developer success, but there are many additional reasons that developers like Kubernetes. These include its inherent resilience, repeatability, flexibility, and visibility. Because Kubernetes is flexible rather than prescriptive, it adapts to a wide range of developer needs. A subset of these developer community are data scientists & data engineers that perform machine learning as AI takes on a significant role in today’s business.
GPUs for Machine Learning
With the impending end to Moore’s law, the spark that is fueling the current revolution in deep learning is having enough compute horsepower to train neural-network based models in a reasonable amount of time. The needed compute horsepower is derived largely from GPUs, which NVIDIA began optimizing for deep learning since 2012. A lot of the machine learning and AI related work relies on processing large blocks of data, which makes GPUs a good fit for ML tasks. Most of the machine learning frameworks have in built support for GPUs. There is a need to provide the capabilities needed by data scientists such as GPU access from Kubernetes environments.
vSphere 7 brings together Kubernetes and GPU sharing capabilities with Bitfusion
With the release of vSphere 7 many new features such as Kubernetes and Bitfusion were introduced. Kubernetes is the preferred platform for developers and is now fully integrated with vSphere as first class citizens along with virtual machines. vSphere Bitfusion provides the ability to share NVIDIA GPUs over the network. Modern applications quite often need access to GPUs and their massive compute capabilities for timely and efficient processing. There is a significant need for Kubernetes based developer environments to access GPUs. A combination of the VMware Kubernetes platform and vSphere Bitfusion for GPU sharing over the network can help meet the infrastructure needs of modern data scientists. This solution show cases the integration of VMware Kubernetes platforms like TKG and TKGI with vSphere Bitfusion.
Tanzu Kubernetes Grid Integrated (TKGI formerly VMware Enterprise PKS)
TKGI is a purpose-built container solution to operationalize Kubernetes for multi-cloud enterprises and service providers. It significantly simplifies the deployment and management of Kubernetes clusters with day 1 and day 2 operations support. With hardened production-grade capabilities, TKGI takes care of your container deployments from the application layer all the way to the infrastructure layer.
TKGI builds on Kubernetes, BOSH, VMware NSX-T, and Project Harbor to form a production-grade, highly-available container runtime that operates on vSphere and public clouds. With built-in intelligence and integration, TKGI ties all these open source and commercial modules together, delivering a simple-to-use product for customers, ensuring the customers have the most efficient Kubernetes deployment and management experience possible.
The Solution:
There is a need for modern application developers and data scientists to leverage Kubernetes and be able to access GPUs for their training and other needs. It can be very cost prohibitive to provide these developers individually with GPUs. vSphere Bitfusion provides access over the network to GPUs that are aggregated into a dedicated vSphere cluster or resource pool. The solution shows the integration between three different types of Tanzu Kubernetes with Bitfusion that includes TKG Guest, TKG Supervisor & TKGI clusters.
The logical schematic below shows the components of the solution as validated.
Figure 1: Logical schematic of solution
All GPU resources are consolidated in a GPU Cluster. The Bitfusion servers are sourced from this cluster where they are direct attached to the GPU in pass-through mode. Kubernetes Clusters that includes different variants of VMware Kubernetes such as Supervisor, TKG Guest and TKGI represented as VMware compute clusters are connected via high speed datacenter networking to the Bitfusion server farm. Egress network traffic from the Kubernetes clusters are routed by NSX-T to the Bitfusion GPU server farm to establish client-server connectivity to the remote GPUs. The GPU cluster and the Compute clusters are running vSphere 7 and managed by vCenter 7.
Enabling Bitfusion access:
Bitfusion server components can be installed and enabled as discussed here. After Bitfusion is installed and integrated with vCenter and licensed, it appears as a plugin for vCenter. Virtual machines associated with the vCenter can now be enabled for Bitfusion Client access. For Kubernetes environments, the Bitfusion client needs to be embedded inside the container/pod. An example Linux virtual machine is used to obtain these client files. The Linux machine is enabled for Bitfusion as shown in the process below. Once enabled the files are extracted from the Linux machine and used to build the container as a Bitfusion client. The process of enabling a virtual machine for Bitfusion client access is shown in Figure below.
Figure 2: Enabling Bitfusion client for a virtual machine
Customization to enable Bitfusion in containers:
Bitfusion client components are installed in the container itself so that it works seamlessly after the pod is deployed. No changes are needed to the Kubernetes worker nodes. Here are the steps to enable Bitfusion client to work inside a container.
Once you identify a client Linux machine for Bitfusion Client deployment,
- Enable Bitfusion on the client Linux machine
- Install Bitfusion client for that operating system (point to link in documentation)
Confirm that it works fine by running the command:
bitfusion list_gpus
Now this bitfusion client machine has the necessary authorization files for Bitfusion access.
Copy the following files to a staging directory, say, bitfusion-files:
cp ~/.bitfusion/client.yaml bitfusion-files/
cp /etc/bitfusion/servers.conf bitfusion-files/
cp /etc/bitfusion/tls/ca.crt bitfusion-files/
Additionally, download and copy the appropriate installer .deb or .rpm file to the same staging directory:
For example, for Ubuntu OS, we will use a .deb file as below:
cp bitfusion-client-ubuntu1604_2.0.0_amd64.deb bitfusion-files/
When you create a docker image for a container, you need to install bitfusion client as well as copy the authorization files to the appropriate directories in the container image. Additionally, install open-vm-tools. In the example shown we are copying files to appropriate directories for the user root.
For this, add the following to your Docker file:
#——————————————————————–
# Copy bitfusion client related files
#———————————————————————
# Copy bitfusion files
RUN mkdir -p /root/.bitfusion
COPY ./bitfusion-files/client.yaml /root/.bitfusion/client.yaml
COPY ./bitfusion-files/servers.conf /etc/bitfusion/servers.conf
RUN mkdir -p /etc/bitfusion/tls
COPY ./bitfusion-files/ca.crt /etc/bitfusion/tls/ca.crt
#———————————————————————
# Update package list
# Install Bitfusion. Use deb file for Ubuntu16.04
#———————————————————————
# Set initial working directory
RUN mkdir -p /workspace/bitfusion
WORKDIR /workspace/bitfusion
# Copy Release version of bitfusion client
COPY ./bitfusion-files/bitfusion-client-ubuntu1604_2.0.0_amd64.deb .
RUN apt-get update \
&& apt-get install -y ./bitfusion-client-ubuntu1604_2.0.0_amd64.deb \
&& apt-get install -y open-vm-tools \
&& \
rm -rf /var/lib/apt/lists/
#———————————————————————
These steps ensure that the Bitfusion components are baked into the container image itself. Appendix A and C show the docker file used to create TensorFlow and PyTorch containers with Bitfusion client embedded in it.
Kubernetes Setup:
Kubernetes Workload management is a new capability that helps manage Kubernetes workloads in vSphere 7. All Tanzu Kubernetes clusters and pods are visualized and managed in vCenter using workload management. The Tanzu Kubernetes clusters used in the solution were deployed and is shown below.
Figure 3: Kubernetes namespaces in vCenter Workload management
The hpc2 namespace represents the Kubernetes namespace running in the vSphere Cluster HPC2. All aspects of the Kubernetes namespace such as control plane nodes, TKG guest clusters, storage policies, etc. are seen. The figure below shows the representation of the namespace, its clusters including control and worker nodes in the “Hosts and Clusters” vCenter view.
Figure 4: Kubernetes components as seen in traditional “Hosts and Clusters” view
An example guest cluster hpc2-dev-cluster5 is shown with its control and worker nodes. The PyTorch virtual machine is a podVM running in the supervisor cluster for hpc2. The supervisor cluster control nodes are also shown at the bottom of the list. Kubernetes command line is used to login to a guest cluster and by setting the context as shown below.
Figure 5: Kubernetes login and node listing for a TKG guest cluster
Similarly, one can directly login to the Supervisor cluster to run native pods that run directly on the ESXi hypervisor. A TKGI Kubernetes cluster was also setup in the same vCenter similarly and used for the validation.
Full details of the solution can be found in this whitepaper.
In part 2 of this series we will look at the validation of the solution with Bitfusion access from TKG guest, TKG supervisor and TKGI Kubernetes cluster variants.
Call to Action:
- Consider upgrading to vSphere 7 with Tanzu to meet developer needs
- Consolidate GPUs into a GPU server farm to optimize usage
- Provide easy access to GPUs for your data scientists and developers from Kubernetes by leveraging vSphere Bitfusion