Kubernetes Machine Learning

Scaling Distributed Machine Learning leveraging vSphere, Bitfusion and NVIDIA GPU (Part 2 of 2)

In part 1 of this blog we introduced distributed machine learning and virtualized infrastructure components used in the solution. In this part, we will look at the testing methodology and the results. The scalability of worker nodes and fractional GPUs leveraging Bitfusion will be analyzed.

 

 

5 Testing

 

5.1        Running the Distributed TensorFlow benchmark in Kubeflow with Bitfusion

 

The following is a summary of the steps used to deploy and run the solution.

 

  • VMWare Tanzu Kubernetes Grid (TKG) and Kubeflow were installed.
  • Ubuntu 16.0.4 docker images were used in the solution
  • Used a docker image as the base and installed Bitfusion Flexdirect inside it. Appendix B shows the Docker file that was used to create the image.
  • Created a distributed TensorFlow job using TensorFlow Training (TFJob). TFJob is a Kubernetes custom resource that you can use to run TensorFlow training jobs on Kubernetes.
  • Appendix A shows a YAML file to run a distributed TensorFlow training job with 4 PS and 12 workers each using 0.25 GPUs and using Bitfusion Flexdirect to utilize GPU over the network.
  • Parameter Servers (PS) are servers that provide a distributed data store for the model parameters.
  • Workers: The workers do the actual work of training the model.
  • The following table shows the combinations of TFJobs that were run with different values for GPU (with values like 1.0, 0.5, 0.25 – to test with even partial GPUs) and the number of worker and Parameter Server nodes:

 

Table 3: Number of parameter servers and workers for different use cases

 

  • The job was run with TCPIP as the transport mechanism over 100 GBPS network infrastructure.
  • The training epoch was run for 100 steps and this takes a few minutes on a GPU cluster.
  • The logs were inspected to monitor the progress of the jobs.

 

6       Results:

 

The infrastructure was used to run the distributed TensorFlow machine learning benchmarks by varying the total number of nodes and GPUs used. The tests were run for TCP/IP and the results were compared.

Figure 5: Image processing throughput with Distributed TensorFlow

 

The bar chart shows the comparison of image processing throughput for the three different use cases. The images processed per second shows less than 5% degradation even when distributed across four time the number of nodes as compared to the baseline with each of four nodes having one GPU each.

Table 4: Images processed per second with different worker & GPU combinations

Figure 6: Image processing throughput with Distributed TensorFlow

 

In figure 6 we can see that there is linear scalability in performance of the distributed worker node for fractional GPUs. Depending on the average GPU needs of the community, Bitfusion can be leveraged to provide full or partial GPUs in a flexible manner with minimal impact on performance.

 

7      Conclusion:

Distributed machine learning across multiple nodes can be effectively used for training. The results showed the effectiveness of sharing GPU across jobs with minimal loss of performance. VMware Bitfusion makes distributed training scalable across physical resources and makes it limitless from a GPU resources capability. The solution showcases the benefits of combining best in class infrastructure provide by the NVIDIA GPU with vComputeServer software, VMware SDDC with Bitfusion to run distributed machine learning across a scalable infrastructure.

 

 

Appendix A: YAML File used for the Solution

 

apiVersion: kubeflow.org/v1

kind: TFJob

metadata:

  name: tfjob1

  namespace: kubeflow

spec:

  cleanPodPolicy: None

  tfReplicaSpecs:

    PS:

      replicas: 4

      restartPolicy: OnFailure

      template:

        metadata:

          labels:

            app: tfjob1

        spec:

          affinity:

          podAntiAffinity:

           requiredDuringSchedulingIgnoredDuringExecution:

           – labelSelector:

               matchExpressions:

               – key: app

                 # operator: Exists

                 operator: In

                 values:

                 – tfjob1

             topologyKey: “kubernetes.io/hostname”

          containers:

          – name: tensorflow

            image: pmohan77/tfjob:tf_cuda

            imagePullPolicy: Always

            command:

                – flexdirect

                – run

                – –num_gpus=1

                – –partial=0.25

                – python

                – /opt/tf-benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py

                – –variable_update=parameter_server

                – –model=resnet50

                – –batch_size=32

                – –num_batches=100

                – –data_format=”NCHW”

                – –device=gpu

            volumeMounts:

              – name: bitfusionio

                mountPath: /etc/bitfusionio

                readOnly: true

          volumes:

            – name:  bitfusionio

              hostPath:

                path: /etc/bitfusionio

    Worker:

      replicas: 12

      restartPolicy: OnFailure

      template:

        metadata:

          labels:

            app: tfjob1

        spec:

          affinity:

          podAntiAffinity:

           requiredDuringSchedulingIgnoredDuringExecution:

           – labelSelector:

               matchExpressions:

               – key: app

                 # operator: Exists

                 operator: In

                 values:

                 – tfjob1

             topologyKey: “kubernetes.io/hostname”

          containers:

          – name: tensorflow

            image: pmohan77/tfjob:tf_cuda

            imagePullPolicy: Always

            command:

                – flexdirect

                – run

                – –num_gpus=1

                – –partial=0.25

                – python

                – “/opt/tf-benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py”

                – –variable_update=parameter_server

                – –model=resnet50

                – –batch_size=32

                – –num_batches=100

                – –data_format=”NCHW”

                – –device=gpu

            volumeMounts:

              – name: bitfusionio

                mountPath: /etc/bitfusionio

                readOnly: true

          volumes:

            – name:  bitfusionio

              hostPath:

                path: /etc/bitfusionio               

 

Appendix B: Dockerfile used for the Solution

#

# To use this Dockerfile, use the docker build command.

# See https://docs.docker.com/engine/reference/builder/

# for more information.

#

 

FROM kubeflow/tf-benchmarks-gpu:v20171202-bdab599-dirty-284af3

 

RUN apt-get update && apt-get install -y –no-install-recommends \

    vim wget git \

    && \

    rm -rf /var/lib/apt/lists/ 

 

# Install flexdirect v 1.11.7 first and then upgrade flexdirect to v1.11.9

RUN echo “91.189.88.173   archive.ubuntu.com” >> /etc/hosts \

  && \

  echo “91.189.88.174   security.ubuntu.com” >> /etc/hosts \

  && \

  echo “192.229.211.70  developer.download.nvidia.com” >> /etc/hosts \

  && \

  echo “13.224.29.66 getfd.bitfusion.io” >> /etc/hosts \

  && \

  echo “52.216.204.37 s3.amazonaws.com” >> /etc/hosts \

  && \

  cd /tmp && wget -O installfd    getfd.bitfusion.io \

  && \

  chmod +x installfd \

  && \

  ./installfd  -v fd-1.11.7  — -s -m binaries \

  && \

  export BF_SILENT_INSTALL=1 \

  && \

  export USE_TTY=”” \

  && \

  export BF_INSTALL_FDMODE_COMPATIBILITY=”binaries” \

  && \

  flexdirect upgrade

ENTRYPOINT [“/opt/launcher.py”]