Future of AI/ML Compute is Multi-Cloud:
As enterprises evolve their compute in the cloud era, they are connecting multiple cloud providers based on their unique requirements. Enterprises are looking to leverage the unique capabilities offered by the different cloud providers and build a multi-cloud datacenter. The data processing training and inference for machine learning are proliferating across multiple clouds.
Figure 1: Multi-Cloud AI/ML
An AI-Ready Enterprise Class Virtualization Platform:
VMware provides a robust AI-Ready Enterprise virtualization platform. It provides a modern developer ready Tanzu Kubernetes platform combined with support for advanced hardware like GPUs and SmartNIC.
Figure 2: Developer and Data Scientist ready infrastructure with VMware vSphere with Tanzu
Data scientists and developers can interact directly with the platform through the Tanzu Kubernetes services. The VI admin can operationalize and manage the same infrastructure through the unified VMware platform. VMware platform extends across cloud boundaries and runs in all popular public cloud platforms. This makes VMware the ideal multi-Cloud platform to run your AI/ML workloads.
Traditional Machine Learning:
Traditionally machine learning is isolated, single task learning. The knowledge is not retained or accumulated. The learning process uses existing data to attain knowledge and apply it to the tasks at hand.
Figure 3: Traditional machine learning. (Source: towardsdatascience.com)
In this traditional ML use case, the two learning systems are independent of each other with no sharing of commonalities.
Humans have an inherent ability to transfer knowledge across tasks. What humans learn by knowledge while learning about one task, they utilize in the same way to solve related tasks. The same concepts can be applied with machine learning when the tasks are similar in nature. There are many instances where there is abundance of general data and domain specific data for a similar category is sparse.
Figure 4: Knowledge is transferred between the learning systems .Source: towardsdatascience.com
Image data is a good example where a general-purpose model can be created based on vast amounts of data. This model can then be repurposed and fine-tuned for a specific image recognition task where data is not abundant. Transfer learning is this concept where the general-purpose trained model is adapted and retrained with domain specific data and then used for classification of that domain images through inference. The learning of new tasks relies on
Proof of Concept:
The proof of concept for this solution is depicted in the figure below.
Figure 5: Steps leveraged in the proof of concept for the solution
A large general dataset is used to create a generalized trained model on-premises using consolidated ML optimized infrastructure such as GPUs and network accelerators. The training leveraged Horovod based distributed machine learning on the ImageNet dataset.
Figure 6: Details about the general dataset and the training
A large image dataset that contained more than 10000 classes in all. Some classes that had less than 500 images were removed from the training dataset. Some of these classes that were removed became the domain specific dataset for transfer learning. The training was done across multiple iterations called EPOCHS. 48 EPOCHS were used for the training that happened over several days per use case.
The most accurate model from the training is then used as a starting point for transfer learning. The model is extracted and moved to an AWS S3 bucket that is accessible to the Amazon SageMaker environment. The transferred model is extracted and imported for use in Amazon SageMaker. An unused domain specific class is used for transfer learning, to fine tune the transferred model. This model is then packaged and deployed for inference in Amazon SageMaker endpoints.
In part 2 of this blog series, we will look at leveraging Amazon SageMaker for transfer learning and edge deployments.