By Dave Jaffe, VMware Performance Engineering and Padma Apparao, Intel – VMware Center of
A new white paper is available showing the advantages of running deep learning image classification on the 2nd Generation Intel Xeon Scalable processor compared to previous Intel processors, and to show the performance benefits of running on the VMware vSphere hypervisor compared to bare metal.
The 2nd Generation Intel Xeon Scalable processor’s Deep Learning Boost technology includes new Vector Neural Network Instructions (VNNI), which are especially performant with input data expressed as an 8-bit integer (int8) rather than a 32-bit floating point number (fp32). Together with the large VNNI registers, these instructions provide a marked performance improvement in image classification over the previous generation of Intel Xeon Scalable processors.
The latest version of vSphere, 7.0, supports the VNNI instructions. The work reported in this paper demonstrates a very small virtualization overhead for single image inferencing but major performance advantages for properly configured virtualized servers compared to the same servers running as bare metal.
PerfPsychic our AI-based performance analyzing tool, enhances its accuracy rate from 21% to 91% with more data and training when debugging vSAN performance issues. What is better, PerfPsychic can continuously improve itself and the tuning procedure is automated. Let’s examine how we achieve this in the following sections.
How to Improve AI Model Accuracy
Three elements have huge impacts on the training results for deep learning models: amount of high-quality training data, reasonably configured hyperparameters that are used to control the training process, and sufficient but acceptable training time. In the following examples, we use the same training and testing dataset as we presented in our previous blog.
We in VMware’s Performance team create and maintain various tools to help troubleshoot customer issues—of these, there is a new one that allows us to more quickly determine storage problems from vast log data using artificial intelligence. What used to take us days, now takes seconds. PerfPsychic analyzes storage system performance and finds performance bottlenecks using deep learning algorithms.
Let’s examine the benefit artificial intelligence (AI) models in PerfPsychic bring when we troubleshoot vSAN performance issues. It takes our trained AI module less than 1 second to analyze a vSAN log and to pinpoint performance bottlenecks at an accuracy rate of more than 91%. In contrast, when analyzed manually, an SR ticket on vSAN takes a seasoned performance engineer about one week to deescalate, while the durations range from 3 days to 14 days. Moreover, AI also wins over traditional analyzing algorithms by enhancing the accuracy rate from around 80% to more than 90%.