How VMware Diagnoses Performance Issues with Machine Learning in a Hybrid-Cloud Environment

By the VMware Performance Analytics Team (Amitabha Banerjee, Chien-Chia Chen, Chien-Chun Hung, Xiaobo Huang, Yifan Wang, Razvan Chevesaran, Rajesh Somasundaran)

As one of the top hybrid-cloud providers, VMware has the largest enterprise software-defined datacenter (SDDC) deployments—and making sure those deployments perform well is our top priority. However, managing performance diagnostics in hybrid-cloud enterprise software deployments is extremely difficult mainly due to the complicated system interaction and the numerous system-wide performance metrics to track.

At VMware, we leverage machine learning (ML) techniques to develop a performance analytics solution for diagnosing hybrid-cloud enterprise software deployments. The figure below depicts the architecture of our solution.

As shown in the upper left of the figure in dark gray (CEIP/Data), when a customer opts into the VMware customer experience improvement program (CEIP), their enterprise deployment of the VMware SDDC stack (shown on the left in black) then constantly generates many performance-metric time-series data across all the layers. This big data is sent to the performance diagnostics service in the VMware Analytics Cloud (VAC) and is stored in a data lake, as shown top-center in the figure.

Below the VMware Analytics Cloud, in blue text, you can see ML. In this stage of the workflow, ML scientists at VMware develop and experiment with various ML models. Under Ops (green text) the constructed models are trained using a series of automatic operations—like feature selection, data curation, and model training—before they are sent to the model store (upper right). The performance diagnostics service leverages the stored ML models for various diagnostics tasks over the CEIP data and sends the results (gray arrow on the left) back to the users at the enterprise deployments. The users’ feedback for the diagnostics results are recorded at a feedback database along with other feedback from VMware Support Insight. This feedback can be further used for continuously retraining our ML models and keeping the models up to date.

We addressed several unique challenges in this ML-based performance analytics solution:

Decoupling the model deployment and the software update
Handling performance drifts resulting from the software updates
Selecting model features based on both the domain experts and the automatic process
Balancing between the false positives and false negatives in the alarms of performance anomalies
Root-causing the detected performance anomalies

Some highlights about our experiences of operating this ML-based performance analytics solution in an enterprise production environment include:

Integrating users’ feedback for continuous model retraining and keeping the models up to date
Understanding the importance of the visualized monitoring and the automation mechanism
Creating an orchestrated environment for experimenting and validating the model behavior
Sharing how we captured a performance drift issue through our analytics solution

To learn more, please see our full paper published at OpML’20, as well as the presentation slides and the video.

Acknowledgements

We thank the following:

VMware Analytics Cloud team
VMware Support Insight team
VMware Research group: Parikshit Gopalan, Udi Wieder

References

The conference page of this work presented at the 2020 USENIX Conference on Operational Machine Learning (OpML)