VMware’s AI-based Performance Tool Can Improve Itself Automatically

PerfPsychic our AI-based performance analyzing tool, enhances its accuracy rate from 21% to 91% with more data and training when debugging vSAN performance issues. What is better, PerfPsychic can continuously improve itself and the tuning procedure is automated. Let’s examine how we achieve this in the following sections.

How to Improve AI Model Accuracy

Three elements have huge impacts on the training results for deep learning models: amount of high-quality training data, reasonably configured hyperparameters that are used to control the training process, and sufficient but acceptable training time. In the following examples, we use the same training and testing dataset as we presented in our previous blog.

Amount of Training Data

The key of PerfPsychic is to prove the effectiveness of our deep learning pipeline, so we start by gradually adding more labeled data to the training dataset. This is to demonstrate how our models learn from more labeled data and improve their accuracy over time. Figure 1 shows the results where we start from only 20% of the training dataset and gradually label 20% more each time. It shows a clear trend that as more properly labeled data is added, our model learns and improves its accuracy, without any further human intervention. The accuracy is improved from around 50% when we have only about 1,000 data points to 91% when we have the full set of 5,275 data points. Such accuracy is as good as a programmatic analytic rule that took us three months to tune manually.

Figure 1. Accuracy improvement over larger training datasets

Training Hyperparameters

We next vary several other CNN hyperparameters to demonstrate how they were selected for our models. We change only one hyperparameter at a time and train 1,000 CNNs using the configuration. We first vary a different number of iterations in training, namely for how many times we go through the training dataset. If the number of iterations is too few, the CNNs cannot be trained adequately and, if the iteration number is too large, training will take a much longer time and it also might end up overfitting to the training data. As shown in Figure 2, between 50 to 75 iterations is the best range, where 75 iterations achieve the best accuracy of 91%.

Figure 2. Number of training iterations vs. accuracy

We next vary the step size, which is our granularity to search for the best model. In practice, with a small step size, the optimization is so slow that it cannot reach the optimal point in a limited time. With a large step size, we risk passing optimal points easily. Figure 3 shows that, between 5e-3 to 7.5e-3, the model produces good accuracy, where 5e-3 predicts 91% of the labels correctly.

Figure. 3 Step size vs. accuracy

We last evaluate the impact of issue rate of the training data in terms of accuracy. Issue rate is the percentage of training data that represents performance issues among the total. In an ideal set of training data, all the labels should be equally represented to avoid overfitting. A biased dataset generally results in overfitting models that can barely achieve high accuracy. Figure 4 below shows that when the training data have under 20% of issue rate (that is, under 20% of the components are faulty), the model basically overfits to “noissue” data points and predicts all components are issue-free. Because our testing data have 21.9% of components without issues, it stays at 21.9% accuracy. In contrast, when we have over 80% of an issue rate, the model simply treats all components as faulty and thus achieves the 78.1% accuracy. This explains why it is important to ensure every label is equally represented, and why we mix our issue/noissue data in a ratio between 40% to 60%.

Figure 4. Impact of issue rate

Training Duration

Training time is also an important factor in a practical deep learning pipeline design. As we train thousands of CNN models, spending one second longer to train a model means a whole training phase will take 1,000 seconds longer. Figure 9 below shows the training time vs. data size and the number of iterations. As we can see, both factors form a linear trend; that is, with more data and more iterations, training will take linearly longer. Fortunately, we know from the study above that any more than 75 iterations will not help accuracy. By limiting the number of iterations, we can complete a whole phase of training in less than 9 hours. Again, once the off-line training is done, the model can perform real-time prediction in just a few milliseconds. The training time simply affects how often and how fast the models can pick up new feedback from product experts.

Figure 5. Effect of data size and iteration on training time

Automation

The model selection procedure is fully automated. Thousands of models with different hyperparameter settings are training in parallel on our GPU-enabled servers. The trained results compete with each other by analyzing our prepared testing data and reporting the final results. We then pick the model with the highest correct rate, put it into PerfPsychic and use it for online analysis. Moreover, we also keep a record of the parameters in the the winning models and use them as initial setups in future trainings. Therefore, our models can keep evolving.

PerfPsychic in Application

PerfPsychic is not only a research product, but also an internal performance analysis tool which is widely used. Now it is used to automatically analyze vSAN performance bugs on Bugzilla.

PerfPsychic automatically detects new vSAN performance bugs submitted in Bugzilla and extracts its usable data logs in the bug attachment. Then it analyzes the logs with the trained models. Finally, the analysis results are emailed to bug submitters and vSAN developer group where performance enhancement suggestions are included.

Below is part of an email received yesterday that gives performance tuning advice on a vSAN bug. Internal information are hidden.

Figure 6. Part of email generated by PerfPsychic to offer performance improvement suggestions