Introduced in vSphere 5.5, a Linux-based driver was added to support 40GbE Mellanox adapters on ESXi. Now vSphere 6.0 adds a native driver and Dynamic NetQueue for Mellanox, and these features significantly improve network performance. In addition to the device driver changes, vSphere 6.0 includes improvements to the vmxnet3 virtual NIC (vNIC) that allows a single vNIC to achieve line-rate performance with 40GbE physical NICs. Another performance feature introduced in 6.0 for high bandwidth NICs is NUMA Aware I/O which improves performance by collocating highly network-intensive workloads with the device NUMA node. In this blog, we highlight these features and the corresponding benefits achieved.
We used two identical Dell PowerEdge R720 servers with Intel E5-2667 @ 2.90GHz and 64GB of memory and Mellanox Technologies MT27500 Family [ConnectX-3] / Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network NICs for our tests.
In the single VM test, we used 1 RHEL 6 VM with 4 vCPUs on each ESXi host with 4 netperf TCP streams running. We then measured the cumulative throughput for the test.
For the multi-VM test, we configured multiple RHEL VMs with 1 vCPU each and used an identical number of VMs on the receiver side. Each VM used 4 sessions of netperf for driving traffic, and we measured the cumulative throughput across the VMs.
Single vNIC Performance Improvements
In order to achieve line-rate performance for vmxnet3, changes were made to the virtual NIC adapter for vSphere 6.0 so that multiple hardware queues could push data to vNICs simultaneously. This allows vmxnet3 to use multiple hardware queues from the physical NIC more effectively. This not only increases the throughput a single vNIC can achieve, but also helps in overall CPU utilization.
As we can see from figure 1 below, 1 VM with 1 vNIC on vSphere 6.0 can achieve more than 35Gbps of throughput as compared to 20Gbps achieved in vSphere 5.5 (indicated by the blue bar chart). The CPU used to receive 1Gbps of traffic, on the other hand, is reduced by 50% (indicated by the red line chart).
Figure 1. 1VM vmxnet3 Receive throughput
By default, a single vNIC can receive packets from a single hardware queue. To achieve higher throughput, the vNIC has to request more queues. This can be done by setting ethernetX.pnicFeatures = “4” in the .vmx file. This option also requires the physical NIC to have RSS mode turn on. For Mellanox adapters, the RSS feature can be turned on by reloading the driver with num_rings_per_rss_queue=4.
CPU Cost Improvements for Mellanox 40GbE NIC
In addition to scalability improvements for the vmxnet3 adapter, vSphere 6.0 features an improved version of the Mellanox 40GbE NIC driver. The updated driver uses vSphere 6.0 APIs and, as a result, performs better than the earlier Linux-based driver. Native APIs remove the extra CPU overheads of data structure conversion that were earlier present in the Linux-based driver. The driver also has new features like Dynamic NetQueue that improves CPU utilization even further. Dynamic netqueue in vSphere 6.0 intelligently chooses the optimal number of active hardware queues in use according to the network workload and per NUMA-node CPU utilization.
Figure 2: Multi VM CPU usage for 40G traffic
As seen in figure 2 above, the new driver can improve CPU efficiency by up to 22%. For all these test cases, the Mellanox NIC was achieving line-rate throughput for both vSphere 6.0 and vSphere 5.5. Please note that for the multi-VM tests, we were using a 1-vCPU VM and vmxnet3 was using a single queue. The RSS feature on the Mellanox Adapter was also turned off.
NUMA Aware I/O
In order to achieve the best performance out of 40GbE NICs, it is advisable to place the throughput-intensive workload on the same NUMA system to which the adapter is attached. vSphere 6.0 features a new configuration option that tries to do this automatically and is available through a system-wide option. The configuration will pack all kernel networking threads on the same NUMA node to which the device is connected. The scheduler will then try to place the VMs that use these networking threads the most on the same NUMA node. By default, the configuration is turned off because it may cause uneven workload distribution between multiple NUMA nodes, especially in the cases where all NICs are connected to the same NUMA node.
Figure 3: NUMA I/O benefit.
As seen in Figure 3 above, NUMA I/O can result in about 20% reduced CPU consumption and about 20% higher throughput with a 1-vCPU VM for 40GbE NICs. There is no throughput improvement for Intel NICs because we achieve line rate irrespective of where the workloads are placed. We do however see an increase in CPU efficiency of about 7%.
To enable this option, set the value of Net. NetNetqNumaIOCpuPinThreshold in the Advanced System Settings tab for the host. The value is configurable and can vary between 0 and 200. For example, if you set the value to 100, this results in using NUMA I/O as long as the networking load is less than 100% (that is, the networking threads do not use more than 1 core). Once the load increases to 100%, vSphere 6.0 will follow default scheduling behavior and will schedule VMs and networking threads across different NUMA nodes.
vSphere 6.0 includes some great new improvements in network performance. In this blog, we show:
- Vmxnet3 can now achieve near line-rate performance with a 40GbE NIC.
- Significant performance improvements were made to the Mellanox driver, which is now up to 25% more efficient.
- vSphere also features a new option to turn on NUMA I/O that could improve application performance by up to 15%.