Railway in Melbourne at night, multiple exposure
Cloud Native Kubernetes Machine Learning

Machine Learning on Kubernetes with Caffe2 & PyTorch on VMware SDDC & PKS Enterprise (Part 1 of 2)


Kubernetes is a popular platform to deploy modern applications. In an earlier blog we look at how Tensorflow can be leveraged in the vSphere platform for common ML use cases. In this series we will look at running Caffe2 and PyTorch which are popular open source ML platforms. This solution focused on validating Caffe2 and PyTorch on VMware Software Defined Datacenter (SDDC) for machine learning with NVIDIA GPUs. The solution was deployed and tested on Kubernetes powered by VMware Enterprise PKS.

Caffe2 & PyTorch

Caffe2 is a light-weight and modular framework that comes production-ready. Caffe2 is designed with expression, speed, and modularity in mind, allowing for a more flexible way to organize computation and it aims to provide an easy and straightforward way for you to experiment with deep learning by leveraging community contributions of new models and algorithms. Caffe2 comes with native Python and C++ APIs.

PyTorch is an open source deep learning platform that provides a seamless path from research prototyping to production deployment. The main focus of Caffe2 development has been performance and cross-platform deployment whereas PyTorch has focused on flexibility for rapid prototyping and research. PyTorch is based on python with a dynamic approach to graph computation, with fast deep learning, increased developer productivity which is easier to learn and simpler to code.

VMware Enterprise PKS

VMware Enterprise PKS is a purpose-built product that enables enterprises and service providers to simplify the deployment and operations of Kubernetes clusters. (Source: VMware Enterprise PKS Overview) It provides a production-grade Kubernetes distribution with deep NSX-T integration for advanced networking, a built-in private registry with enterprise security features and full life cycle management support of the clusters. VMware Enterprise PKS uses the latest stable open source distribution of Kubernetes with no proprietary extensions. VMware Enterprise PKS is built to support multi-cloud environments through BOSH, an open source project in the Cloud Foundry Foundation. VMware Enterprise PKS runs on vSphere, Google Cloud Platform and Amazon EC2

Figure 1: VMware Enterprise PKS works with VMware SDDC.  Source: VMware Enterprise PKS Overview

The major benefits offered by the platform include:

  • Simplified Operations: Streamline both day-1 deployment and day-2 operations tasks with full lifecycle management of multiple clusters and enhanced isolation, security and performance.
  • Built for Production: VMware Enterprise PKS is built for running critical workloads in production, with enterprise features such as enhanced security, high availability, rolling upgrade, constant health monitoring and self-healing.
  • Comprehensive Solution: VMware Enterprise PKS addresses a broad range of Kubernetes challenges such as networking, security, storage, monitoring and logging. This is achieved by including NSX-T and Harbor and integrating with other VMware products.
  • A Multi-Cloud World: VMware Enterprise PKS runs seamlessly on vSphere, as well as on public clouds like Google Cloud Platform and Amazon EC2.

VMware Software Defined Data Center (SDDC)

The VMware Software Defined Data Center infrastructure enables the capability to define infrastructure components as software. VMware SDDC makes it possible to centrally manage all of the data center configuration information and provides a powerful, flexible, and secure foundation for business agility that accelerates your digital transformation to hybrid cloud and success in the digital economy.  The PKS solution leverages the VMware SDDC to create an enterprise class Kubernetes environment.

Figure 2: The VMware Software Defined Data Center

The VMware Enterprise PKS platform brings a lot of benefits that include:

  • Higher efficiency and lower costs. Virtualized IT services and automated operations management drive new levels of resource utilization and staff productivity.
  • Application provisioning in minutes. Policy-based configuration lets you deliver workloads in minutes, with resources that adjust automatically changing business demands.
  • The right availability and security for every application. Automated business continuity and virtualization-aware security provide exceptional uptime and control of resources.
  • Any workload delivered anywhere. Run both new and existing applications across multiple platforms and clouds, with instant delivery to any user on any desktop or mobile device.

HPE Deep Learning Benchmarking Suite

The testing was done with HPE Deep Learning Cookbook. Deep Learning Benchmarking Suite (DLBS) is a collection of command line tools for running consistent and reproducible deep learning benchmark experiments on various hardware/software platforms. DLBS provides implementation of a number of neural networks and is used to perform apples-to-apple comparison across all supported frameworks. Multiple models such as VGGs, ResNets, AlexNet and GoogleNet are supported. List of supported frameworks include various forks of Caffe (BVLC/NVIDIA/Intel), Caffe2, TensorFlow, MXNet, PyTorch. DLBS also supports NVIDIA’s inference engine TensorRT for which DLBS provides highly optimized benchmark backend.

Virtual Infrastructure Components

The virtual infrastructure used to build the solution is shown below:

Table 1: HW components of the solution

The VMware SDDC and other SW components used in the solution are shown below:

Table 2: SW components of the solution

Logical Architecture of Solution Deployed

PKS provided the framework to create Kubernetes clusters seamlessly working with the VMware SDDC components. A logical schematic of the  Kubernetes cluster and MongoDB Enterprise components are shown

Figure 3: Logical Schematic of Solution

Solution Deployment:

  • VMware Enterprise PKS was installed on a vSphere Cluster with six R740 Dell Poweredge servers.
  • Two of these nodes had one NVIDIA GPU V100 cards each.
  • FlexDirect server was deployed on two Linux virtual machines, each attached to an NVIDIA GPU.
  • A three node Kubernetes cluster was created under PKS
  • Created docker images with the following components:
    1. Bitfusion Flexdirect Client
    2. Caffe2 or PyTorch
  • HP Deep Learning Cookbook was customized to leverage Bitfusion Flexdirect and for automation

In part 2 we will look at the validation of the solution and the results.