The IoT Analytics Benchmark released last year dealt with an important Internet of Things use case—monitoring factory sensor data for impending failure conditions. This year, we are tackling an equally important use case—image classification. Whether used in facial recognition, license plate readers, inspection systems, or autonomous vehicles, neural network–based deep learning is making image detection and classification a viable technology.
As in the classic machine learning used in the original IoT Analytics Benchmark code (which used the Spark Machine Learning Library), the new deep learning code first trains a model using pre-labeled images and then deploys that model to infer the classification of new images. For IoT this inference step is the most important. Thus, the new programs, designated as IoT Analytics Benchmark DL, use previously trained models (included in the kit) to demonstrate inferencing that can be performed at the edge (on small gateway systems) or in scaled-out Spark clusters.
Data scientists may use GPUs on vSphere that are dedicated to use by one virtual machine only for their modeling work, if they need to. Certain heavier machine learning workloads may well require that dedicated approach. However, there are also many ML workloads and user types that do not use a dedicated GPU continuously to its maximum capacity. This presents an opportunity for shared use of a physical GPU by more than one virtual machine/user. This article explores the performance of a shared-GPU setup like this, supported by the NVIDIA GRID product on vSphere, and presents performance test results that show that sharing is a feasible approach. The other technical reasons for sharing a GPU among multiple VMs are also described here. The article also gives best practices for determining how the sharing of a GPU may be done.
VMware vSphere supports NVIDIA GRID technology for multiple types of workloads. This technology virtualizes GPUs via a mediated passthrough mechanism. Initially, NVIDIA GRID supported GPU virtualization for graphics workloads only. But, since the introduction of Pascal GPU, NVIDIA GRID has supported GPU virtualization for both graphics and CUDA/machine learning workloads. With this support, multiple VMs running GPU-accelerated workloads like machine learning/deep learning (ML/DL) based on TensorFlow, Keras, Caffe, Theano, Torch, and others can share a single GPU by using a vGPU provided by GRID. This brings benefits in multiple use cases that we discuss on this post.
We in VMware’s Performance team create and maintain various tools to help troubleshoot customer issues—of these, there is a new one that allows us to more quickly determine storage problems from vast log data using artificial intelligence. What used to take us days, now takes seconds. PerfPsychic analyzes storage system performance and finds performance bottlenecks using deep learning algorithms.
Let’s examine the benefit artificial intelligence (AI) models in PerfPsychic bring when we troubleshoot vSAN performance issues. It takes our trained AI module less than 1 second to analyze a vSAN log and to pinpoint performance bottlenecks at an accuracy rate of more than 91%. In contrast, when analyzed manually, an SR ticket on vSAN takes a seasoned performance engineer about one week to deescalate, while the durations range from 3 days to 14 days. Moreover, AI also wins over traditional analyzing algorithms by enhancing the accuracy rate from around 80% to more than 90%.