By Dave Jaffe, VMware Performance Engineering and Padma Apparao, Intel – VMware Center of

A new white paper is available showing the advantages of running deep learning image classification on the 2nd Generation Intel Xeon Scalable processor compared to previous Intel processors, and to show the performance benefits of running on the VMware vSphere hypervisor compared to bare metal.

The 2nd Generation Intel Xeon Scalable processor’s Deep Learning Boost technology includes new Vector Neural Network Instructions (VNNI), which are especially performant with input data expressed as an 8-bit integer (int8) rather than a 32-bit floating point number (fp32). Together with the large VNNI registers, these instructions provide a marked performance improvement in image classification over the previous generation of Intel Xeon Scalable processors.

The latest version of vSphere, 7.0, supports the VNNI instructions. The work reported in this paper demonstrates a very small virtualization overhead for single image inferencing but major performance advantages for properly configured virtualized servers compared to the same servers running as bare metal.

Four different image classification tests utilizing the Intel-optimized version of TensorFlow with pretrained models in fp32 and int8 precisions were performed:

In Single Image Classification Latency, the benchmark program sent one image at a time (batch_size = 1) through a pre-trained ResNet50 neural network model. The metric recorded was the latency, or average time to classify a single image, in milliseconds (msec). There was a small (1.5% or less) virtualization overhead in this test.

In Large Batch Image Classification Throughput, the batch size was set to 1024 and the metric recorded was throughput in images per second. There was no constraint on the average image classification latency. As shown below, the Large Batch Image Classification Throughput of 2nd Generation Intel Xeon Scalable processors using the int8 quantization was 3.49x that of the 1st Generation Xeon Scalable processors using fp32.

Single Image Classification Throughput Scaling measures the throughput of running single image classification in multiple instances as the number of instances are increased. For the bare metal case, the benchmark program was run in a single instance, and then run simultaneously in multiple instances, with the number of instances increasing from 2 to 8. For the virtualized case, the benchmark program was run simultaneously in separate VMs, with the number of VMs increasing from 1 to 8. The metric recorded was throughput, in images per second. As shown in the figure below, the single-instance bare metal result, which uses the entire server, is faster than the virtualized result from one small (12 vCPU, 90 GB) VM, but as the number of bare metal instances and VMs increase, the total throughput in the virtualized case overtakes that of the bare metal case and ends up being 2.04x faster due to the better resource utilization afforded by virtualization.

In Multistream Image Classification Throughput, the program sent multiple images at once (batch_size > 1) through the ResNet50 inference engine. The maximum throughput achievable by the bare metal server and the virtualized server meeting a specified latency constraint was measured. The latency constraint used was 33.3 msec, equivalent to 30 video frames per second. The virtualized server with int8 quantization had 9% better throughput (images per second) than an identical bare metal server.

All details are in the paper.