posted

0 Comments

In Part 1 we described the solution components and the test cases. In this part we will look at the results from our testing.

Benchmark: tf_cnn_benchmarks suite (TensorFlow)

 

The testing was done with tf_cnn_benchmarks, one of Tensorflow benchmark suites. This benchmark suite is designed for performance, since its models utilize the strategies employed in the Tensorflow Performance Guide. It contains implementations of several popular convolutional models, including Inception3 and resnet50.

Results

 

The native use case is treated as the baseline for all other tests. The results of the native tests are normalized to the numerical value 1 and all other tests are relative to this baseline.

 

Use Case 1: Local

Figure 5: Local Run with Bitfusion server and client on same GPU VM with 100 Gb/s RoCE configured in Passthrough mode and 10 Gb/s VMXNET3

 

As shown in Figure 5, the results show no difference in performance between 10 Gb/s VMXNET3 and 100 Gb/s RoCE. Since the networking between server and client is local the traffic never traverses the network and hence network speeds do not affect the result.

 

Use Case 2: Local Partials

 

Figure 6: Local Run with Bitfusion server and two clients on same GPU VM with 100 Gb/s RoCE configured in Passthrough mode and 10 Gb/s VMXNET3

 

As shown in Figure 6, the results show no difference in performance between 10 Gb/s VMXNET3 and 100 Gb/s RoCE for the partial use case as well for the same reasons mentioned above. There is a small throughput reduction when sharing the GPU. Note that larger batch size (number of training samples that going to be propagated through the network) requires more memory, batch size 128 is not able to be fitted into the partitioned memory when sharing.

 

Use Case 3: Remote

Figure 7: Remote Run with Bitfusion client on CPU VM to use remote GPU with 100 Gb/s RoCE configured in Passthrough mode and  10 Gb/s VMXNET3

 

As shown in Figure 7, the results show that the performance overhead is very small for 100 Gb/s RoCE but more significant for 10 Gb/s vmxnet3. The level of overhead depends on benchmark model and training batch size.

Figure 8: Remote Run with Bitfusion client on CPU VM with 10 Gb/s VMXNET3 tuned and untuned

 

The original remote use cases were done with no tunings for latency sensitive applications that apply to vmxnet3 interfaces. The remote tests for vmxnet3 were repeated with tunings.

As shown in Figure 8, with proper networking tunings, the remote case with 10Gb/s VMXNET3 can be greatly improved. The network tunings we have applied are:

  • Latency Sensitivity (default is Normal) is set to High.

How to set: Set Latency Sensitivity using vSphere Web Client: Settings -> VM Options -> Advanced -> Latency Sensitivity

  • Reserve CPU and memory.

For best performance, High Latency Sensitivity requires exclusive access to physical resources, including pCPUs dedicated to vCPUs and full memory reservation for eliminating ballooning or swapping which may lead to latency overheads.

Additional tunings may further improve performance. Please refer to “Best Practices for Performance Tuning of Latency-Sensitive Workloads in vSphere VMs

 

Use Case 4: Remote Partials

   

Figure 9: Remote Run with two Bitfusion clients on two CPU VMs with 100 Gb/s RoCE configured in Passthrough mode and  10 Gb/s VMXNET3

 

Figure 9 shows the remote partial results with 100 Gb/s RoCE Passthrough and 10 Gb/s vmxnet3. It may seem non-intuitive that the performance is not sensitive to the networking option, i.e. 100Gb/s RoCE Passthrough’s performance is close to 10Gb/s Ethernet. It is because the TensorFlow models do not have any data dependencies and the two VMs running jobs can perform calculations with remote partial GPUs, without waiting  for data transfer and are therefore not sensitive to latency. Similarly, as shown in Figure 10, with tunings, the aggregate performance of remotely sharing GPUs shows little difference between vmxnet3 untuned and tuned.

Figure 10: Remote Run with two Bitfusion clients on two CPU VMs with 10 Gb/s VMXNET3 tuned and untuned

 

Conclusion

The results clearly show that it is possible to share GPUs across multiple users and applications with Bitfusion on vSphere. Using a high speed interconnect like RoCE (100 Gb/s) can help bring the performance very close to native. VMXNET3 based traditional virtualized infrastructures can also be effectively used for GPU sharing with tunings for latency sensitivity.  Combining vSphere capabilities with Bitfusion provides maximum flexibility for virtual machines across the environment, to access GPU resources remotely as well as create fractional GPUs of any size, at runtime, with minimum loss in performance. A vSphere and Bitfusion FlexDirect based shared GPU solution helps reduce infrastructure costs, while providing easy and pervasive GPU access for Machine Learning environments.