VMware Speedily Resolves Customer Issues in vSAN Performance Using AI

We in VMware’s Performance team create and maintain various tools to help troubleshoot customer issues—of these, there is a new one that allows us to more quickly determine storage problems from vast log data using artificial intelligence. What used to take us days, now takes seconds. PerfPsychic analyzes storage system performance and finds performance bottlenecks using deep learning algorithms.

Let’s examine the benefit artificial intelligence (AI) models in PerfPsychic bring when we troubleshoot vSAN performance issues. It takes our trained AI module less than 1 second to analyze a vSAN log and to pinpoint performance bottlenecks at an accuracy rate of more than 91%. In contrast, when analyzed manually, an SR ticket on vSAN takes a seasoned performance engineer about one week to deescalate, while the durations range from 3 days to 14 days. Moreover, AI also wins over traditional analyzing algorithms by enhancing the accuracy rate from around 80% to more than 90%.

Architecture

There are two operation modes in the AI module: off-line training mode and real-time prediction mode. In the training mode, sets of training data, which are labeled with their performance issues, are automatically fed to all potential convolutional neural network (CNN) [1] structures, which we train repeatedly on GPU-enabled servers. We train thousands of models at a time and pick the one that achieves the best accuracy to a real-time system. In the real-time prediction mode, unlabeled user data are sent to the model chosen from the training stage, and a prediction of the root cause (faulty component) is provided by it.

As shown in Figure 1, data in both training and prediction modes are first sent to a data preparation module (Queried Data Preparation), where data are formatted for later stages. The data path then diverges. Let’s first follow the dashed line for the data path of labeled training data. They are sent to the deep learning training module (DL Model Training) to train an ensemble of thousands of CNNs generated from our carefully designed structures. After going through all the training data for more than thousands of times and having the training accuracy rate converged to a stable value, the trained CNNs will compete with each other in the deep learning model selection module (DL Model Selection), where they are requested to predict the root causes of testing data that the models have never seen before. Their predictions are compared to the real root causes, which are labeled by human engineers, to calculate the testing accuracy rate. Finally, we provide an ensemble of models (Trained DL Model) that achieve the best testing accuracy to the real-time prediction system.

Figure 1: Deep Learning Module Workflow

You might expect this training process to be both time consuming and resource hungry and so, it should be carried out off-line on servers equipped with powerful GPUs. On the contrary, prediction mode is relatively light-weight and can adapt to real-time applications.

Following the solid line in Figure 1 for prediction mode, the unlabeled normalized user data are sent to our carefully picked models, and the root cause (Performance Exception) is predicted based on a small amount of calculations. The prediction will be returned to the upper layer such as our interactive analytic web UI, automatic analysis, or proactive analysis applications. Like the interactive analytic part, our web UI also has a means of manually validating the prediction, which will automatically trigger the next round of model training. This completes the feedback loop and ensures our models continue to learn from human feedback.

AI Wins Over Manual Debugging

Diagnosing performance problems in a software-defined datacenter (SDDC) is difficult due to both the scale of the systems and the scale of the data. The scale of the software and hardware systems results in complicated behaviors that are not only workload-dependent but also interfering with each other. Thus, pin-pointing a root cause requires thorough examinations of the entire datacenter. However, due to the scale of data collected across a datacenter, this analysis process requires many human efforts, takes an extremely long time, and is prone to errors. Take vSAN for example—dealing with performance-related escalations typically requires cross-departmental efforts examining vSAN stacks, ESXi stacks, and physical/virtual network stacks. In some cases, it took months for many engineers to pinpoint problems outside of the VMware stack, such as physical network misconfigurations. On average, it takes one week to deescalate a client’s service request ticket with the effort of many experienced engineers working together.

PerfPsychic is designed to address challenges we have faced and to further make performance diagnostics more scalable. PerfPsychic builds upon a data infrastructure that is at least 10 times faster and 100 times more scalable than the existing one. It provides an end-to-end interactive analytic UI allowing users to perform the majority of the analysis in one place. The analysis results will then immediately be fed back to the deep learning pipeline in the backend, which produces diagnostic models that detect a faulty component more accurately as more feedback gets collected. These models mostly take only a few hours to train, and can detect faulty components in a given dataset in a few milliseconds, with comparable accuracy to rules that took us months to tune manually.

AI Wins Over Traditional Algorithms

To prove the effectiveness of our AI approach, we tested it against traditional machine learning algorithms.

First, we created two datasets: training data and testing data, as summarized in Table 1.

Table 1: Training and Testing Data Property

Training data are generated from our simulated environment: a simple 4-node hybrid vSAN setup. We manually insert performance errors into our testing environment to collect the training data with accurate labels. In the example of a network issue, we simulate packet drops by having vmkernel drop a receiving packet at VMK TCP/IP for every N packets. This mimics the behavior of packet drops in the physical network. We vary N to produce enough data points for training. Although this does not 100% reproduce what happens in a customer environment, it is still a best practice since it is the only cost-effective way to get a large volume of labeled data which are clean and accurate.

The testing data, in contrast to the training data, are all from customer escalations, which have very different system configurations in many aspects (number of hosts, types and number of disks, workloads, and so on). In our testing data, we have 78.1% of the data labeled with performance issues. Note that the “performance issue” refers to a specific component in the system that is causing the performance problem in the dataset. We define “accuracy” as the percentage of predictions that the CNN model gives the correct label to all components from the testing datasets (“issue” or “no issue”).

With the same training data, we trained one CNN, and four popular machine learning models: Support Vector Machine (SVM) [2], Logistic Classification (LOG) [3], Multi-layer Perceptron Neural Network (MLP) [4] and Multinomial Naïve Bayes (MNB) [5]. Then we tested the five models against the testing dataset. To quantify model performances, we calculate their accuracy as follows.

Finally, we compared the accuracy rate achieved by each model, which are shown in Figure 2. The result reveals that AI is a clear winner, with 91% accuracy.

Figure 2: Analytic Algorithm Accuracy Comparison

Acknowledgments

We appreciate the assistance and feedback from Chien-Chia Chen, Amitabha Banerjee and Xiaobo Huang. We also feel grateful to the support from our manager Rajesh Somasundaran. Lastly, we thank Julie Brodeur for her help in reviewing and recommendations for this blog post.

References

Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions. CoRR,” abs/1409.4842, 2014.
J. Smola, B. Schölkopf, “A Tutorial on Support Vector Regression,” Statistics and Computing Archive Volume 14 Issue 3, August 2004, p. 199-222.
C. Bishop, “Pattern Recognition and Machine Learning,” Chapter 4.3.4.
E. Rumelhart, G. E. Hinton, R. J. Williams, “Learning representations by back-propagating errors,” http://www.iro.umontreal.ca/~pift6266/A06/refs/backprop_old.pdf.
Zhang, “The optimality of Naive Bayes,” Proc. FLAIRS, 2004, http://www.cs.unb.ca/~hzhang/publications/FLAIRS04ZhangH.pdf.