By Chien-Chia Chen, Tzu Yi Chung, Chengzhi Huang, Paul Huang, and Rushita Thakkar
VMware’s Performance Engineering team develops and operates many critical machine learning (ML) services across the VMware product portfolio. As these ML services scale, more operational challenges arise—from multi-cloud operations, and offline ML pipeline monitoring and alerting, to online inference, model, and data monitoring and alerting. This blog shares how we leverage VMware Tanzu and the open-source ML stack to streamline the operations of multi-cloud ML services.
ML plays an important role at VMware in both VMware’s SaaS operations as well as the product development lifecycle. In our earlier post, we presented how we leverage VMware products to scale the performance analytics data infrastructure beyond 100 million time-series data (numerical readings of performance metrics measured periodically). Besides the data infrastructure, more operational challenges also arise as we scale the downstream ML services. Figure 1 shows our typical ML service lifecycle, which consists of the offline components hosted in the on-premises clusters and the online inference services that may run on any cloud close to the target application. Each of these components has different operational challenges at scale.
Figure 1. Multi-cloud ML operations (MLOps) lifecycle
The two primary offline components are feature preprocessing and model training. The preprocessing jobs query features from the feature store, which is a part of the data infrastructure, and stream the transformed features to the training jobs. Kubeflow orchestrates these preprocessing and training jobs.
The biggest operational challenge when running preprocessing and training jobs at scale is to properly monitor and handle failures to ensure jobs do not get stuck indefinitely. Never-ending jobs were once a common problem in our production cluster, whether it was due to bugs in the code (such as a wrong API token) or cluster misconfigurations (such as missing required persistent volumes).
We channel relevant Kubernetes pod metrics such as resource utilization (CPU, memory, network), and time spent in different states (pending, failed, running, etc.) to Prometheus, which is a monitoring system and time-series database. A set of Grafana dashboards monitor these stats. Figure 2 shows an example of the training job dashboard. We also configured a set of Prometheus rules to warn or alert when certain failures that require engineer intervention occur, such as a cluster is overloaded, or a job remains in a state for an unreasonably long time (pending for hours or running for days).
Figure 2. Dashboard for monitoring training jobs
Once the models are trained, the continuous integration and continuous deployment (CI/CD) pipeline pushes the model image to the model registry on Harbor and reconfigures the downstream inference services to promote models to production if specified. The primary operational challenge in this phase is that we need to run the inference services close to the target product, which can be either on-premises in VMware’s data centers or in public clouds such as VMware Cloud on AWS. To streamline this multi-cloud deployment and operation, we package all the required Kubernetes resources using Carvel tools. The same Carvel package repository can then be deployed on any cloud using VMware Tanzu. Once deployed, all the downstream inference services are then managed by the CI/CD jobs to load their required model images to serve the inference requests.
The two most critical operational metrics of the online inference services are their service up-time and the inference response time. Our inference services are based on the open-source Seldon Core and consequently we take advantage of the seldon-core-analytics library to ingest these metrics to Prometheus. Several Prometheus rules are also configured to alert service owners in cases when, for example, service outages occur, long-lasting high inference latency is detected, etc.
Figure 3. Dashboard for monitoring inference services
In addition to the above challenges in operating individual components of the MLOps pipeline, it is equivalently challenging to monitor and react to data distribution drifts and model performance drifts over time. Our paper presented in Data-Centric AI Workshop ‘21 describes the challenges and the approaches we take to tackle them in detail . In a nutshell, we monitor the KL-divergence of the monthly data distributions, as shown in Figure 4. When the data distribution differs over a certain threshold, an automatic model retraining will be triggered. We then perform a series of offline evaluation before promoting a new model to production.
Figure 4. KL-divergence of monthly data distribution
As we see an increasing demand in running an application on any cloud, we share in this blog how VMware Tanzu helps us streamline the multi-cloud operation for our MLOps pipeline. Figure 5 below summarizes the end-to-end tech stack of our production performance analytics MLOps pipeline, which is packaged as a Carvel package repository that can be deployed and operated through VMware Tanzu. In addition to multi-cloud operation, it is important to monitor and alert each component throughout the end-to-end MLOps lifecycle, from Kubernetes pods and inference services, to data and model performance. Our multi-year experiences show that in order to have a successful long-term ML service or product, these operational challenges must be all addressed seriously.
Figure 5. MLOps platform on VMware Tanzu
 X. Huang, A. Banerjee, C.-C. Chen, C. Huang, T. Y. Chuang, A. Srivastava, R. Cheveresan, “Challenges and Solutions to build a Data Pipeline to Identify Anomalies in Enterprise System Performance,” in Data-Centric AI Workshop, 35th Conference on Neural Information Processing Systems (NeurIPS), December 2021, Virtual.