Home > Blogs > VMware VROOM! Blog > Author Archives: Rishi Mehta

Author Archives: Rishi Mehta

Network Improvements in vSphere 6 Boost Performance for 40G NICs

Introduced in vSphere 5.5, a Linux-based driver was added to support 40GbE Mellanox adapters on ESXi. Now vSphere 6.0 adds a native driver and Dynamic NetQueue for Mellanox, and these features  significantly improve network performance. In addition to the device driver changes, vSphere 6.0 includes improvements to the vmxnet3 virtual NIC (vNIC) that allows a single vNIC to achieve line-rate performance with 40GbE physical NICs. Another performance feature introduced in 6.0 for high bandwidth NICs is NUMA Aware I/O which improves performance by collocating highly network-intensive workloads with the device NUMA node. In this blog, we highlight these features and the corresponding benefits achieved.

Test Configuration

We used two identical Dell PowerEdge R720 servers with Intel E5-2667 @ 2.90GHz and 64GB of memory and Mellanox Technologies MT27500 Family [ConnectX-3]  /  Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network NICs for our tests.

In the single VM test, we used 1 RHEL 6 VM with 4 vCPUs on each ESXi host with 4 netperf TCP streams running. We then measured the cumulative throughput for the test.

For the multi-VM test, we configured multiple RHEL VMs with 1 vCPU each and used an identical number of VMs on the receiver side. Each VM used 4 sessions of netperf for driving traffic, and we measured the cumulative throughput across the VMs.

Single vNIC Performance Improvements

In order to achieve line-rate performance for vmxnet3, changes were made to the virtual NIC adapter for vSphere 6.0 so that multiple hardware queues could push data to vNICs simultaneously. This allows vmxnet3 to use multiple hardware queues from the physical NIC more effectively. This not only increases the throughput a single vNIC can achieve, but also helps in overall CPU utilization.

As we can see from figure 1 below, 1 VM with 1 vNIC on vSphere 6.0 can achieve more than 35Gbps of throughput as compared to 20Gbps achieved in vSphere 5.5 (indicated by the blue bar chart). The CPU used to receive 1Gbps of traffic, on the other hand, is reduced by 50% (indicated by the red line chart).

Single VM throughput

Figure 1. 1VM vmxnet3 Receive throughput

By default, a single vNIC can receive packets from a single hardware queue. To achieve higher throughput, the vNIC has to request more queues. This can be done by setting ethernetX.pnicFeatures = “4” in the .vmx file. This option also requires the physical NIC to have RSS mode turn on. For Mellanox adapters, the RSS feature can be turned on by reloading the driver with num_rings_per_rss_queue=4.

CPU Cost Improvements for Mellanox 40GbE NIC

In addition to scalability improvements for the vmxnet3 adapter, vSphere 6.0 features an improved version of the Mellanox 40GbE NIC driver. The updated driver uses vSphere 6.0 APIs and, as a result, performs better than the earlier Linux-based driver. Native APIs remove the extra CPU overheads of data structure conversion that were earlier present in the Linux-based driver. The driver also has new features like Dynamic NetQueue that improves CPU utilization even further. Dynamic netqueue in vSphere 6.0 intelligently chooses the optimal number of active hardware queues in use according to the network workload and per NUMA-node CPU utilization.

40G Performance

Figure 2: Multi VM CPU usage for 40G traffic

As seen in figure 2 above, the new driver can improve CPU efficiency by up to 22%.  For all these test cases, the Mellanox NIC was achieving line-rate throughput for both vSphere 6.0 and vSphere 5.5. Please note that for the multi-VM tests, we were using a 1-vCPU VM and vmxnet3 was using a single queue. The RSS feature on the Mellanox Adapter was also turned off.

NUMA Aware I/O

In order to achieve the best performance out of 40GbE NICs, it is advisable to place the throughput-intensive workload on the same NUMA system to which the adapter is attached. vSphere 6.0 features a new configuration option that tries to do this automatically and is available through a system-wide option. The configuration will pack all kernel networking threads on the same NUMA node to which the device is connected. The scheduler will then try to place the VMs that use these networking threads the most on the same NUMA node. By default, the configuration is turned off because it may cause uneven workload distribution between multiple NUMA nodes, especially in the cases where all NICs are connected to the same NUMA node.

NUMA_IO

Figure 3: NUMA I/O benefit.

As seen in Figure 3 above, NUMA I/O can result in about 20% reduced CPU consumption and about 20% higher throughput with a 1-vCPU VM for 40GbE NICs. There is no throughput improvement for Intel NICs because we achieve line rate irrespective of where the workloads are placed. We do however see an increase in CPU efficiency of about 7%.

To enable this option, set the value of Net. NetNetqNumaIOCpuPinThreshold in the Advanced System Settings tab for the host. The value is configurable and can vary between 0 and 200. For example, if you set the value to 100, this results in using NUMA I/O as long as the networking load is less than 100% (that is, the networking threads do not use more than 1 core). Once the load increases to 100%, vSphere 6.0 will follow default scheduling behavior and will schedule VMs and networking threads across different NUMA nodes.

Conclusion

vSphere 6.0 includes some great new improvements in network performance. In this blog, we show:

  • Vmxnet3 can now achieve near line-rate performance with a 40GbE NIC.
  • Significant performance improvements were made to the Mellanox driver, which is now up to 25% more efficient.
  • vSphere also features a new option to turn on NUMA I/O that could improve application performance by up to 15%.

 

Line-Rate Performance with 80GbE and vSphere 5.5

With the increasing number of physical cores in a system, the networking bandwidth requirement per server has also increased. We often find many networking-intensive applications are now being placed on a single server, which results in a single vSphere server requiring more than one 10 Gigabit Ethernet (GbE) adapter. Additional network interface cards (NICs) are also deployed to separate management traffic and the actual virtual machine traffic. It is important for these servers to service the connected NICs well and to drive line rate on all the physical adapters simultaneously.

vSphere 5.5 supports eight 10GbE NICs on a single host, and we demonstrate that a host running with vSphere 5.5 can not only drive line rate on all the physical NICs connected to the system, but can do it with a modest increase in overall CPU cost as we add more NICs.

We configured a single host with four dual-port Intel 10GbE adapters for the experiment and connected them back-to-back with an IXIA Application Network Processor Server with eight 10GbE ports to generate traffic. We then measured the send/receive throughput and the corresponding CPU usage of the vSphere host as we increased the number of NICs under test on the system.

Environment Configuration

  • System Under Test: Dell PowerEdge R820
  • CPUs: 4 x  Intel Xeon Processors E5-4650 @ 2.70GHz
  • Memory: 128GB
  • NICs:8 x Intel 82599EB 10GbE, SFP+ Network Connection
  • Client: Ixia Xcellon-Ultra XT80-V2, 2U Application Network Processor Server

Challenges in Getting 80Gbps Throughput

To drive near 80 gigabits of data per second from a single vSphere host, we used a server that has not only the required CPU and memory resources, but also the PCI bandwidth that can perform the necessary I/O operations. We used a Dell PowerEdge Server with an Intel E5-4650 processor because it belongs to the first generation of Intel processors that supports PCI Gen 3.0. PCI Gen 3.0 doubles the PCI bandwidth capabilities compared to PCI Gen 2.0. Each dual-port Intel 10GbE adapter needs at least a PCI Gen 2.0 x8 to reach line rate. Also, the processor has Intel Data Direct I/O Technology where the packets are placed directly in the processor cache rather than going to the memory. This reduces the memory bandwidth consumption and also helps reduce latency.

Experiment Overview

Each 10GbE port of the vSphere 5.5 server was configured with a separate vSwitch, and each vSwitch had two Red Hat 6.0 Linux virtual machines running an instance of Apache web server. The web server virtual machines were configured with 1 vCPU and 2GB of memory with VMXNET3 as the virtual NIC adapter.  The 10GbE ports were then connected to the Ixia Application Server port. Since the server had two x16 slots and five x8 slots, we used the x8 slots for the four 10GbE NICs so that each physical NIC had identical resources. For each physical connection, we then configured 200 web/HTTP connections, 100 for each web server, on an Ixia server that requested or posted the file. We used a high number of connections so that we had enough networking traffic to keep the physical NIC at 100% utilization.

Figure 1. System design of NICs, switches, and VMs

The Ixia Xcellon application server used an HTTP GET request to generate a send workload for the vSphere host. Each connection requested a 1MB file from the HTTP web server.

Figure 2 shows that we could consistently get the available[1] line rate for each physical NIC as we added more NICs to the test. Each physical NIC was transmitting 120K packets per second and the average TSO packet size was close to 10K. The NIC was also receiving 400K packets per second for acknowledgements on the receive side. The total number of packets processed per second was close to 500K for each physical connection.

Figure 2. vSphere 5.5 drives throughput at available line rates. TSO on the NIC resulted in lower packets per second for send.

Similar to the send case, we configured the application server to post a 1MB file using an HTTP POST request for generating receive traffic for the vSphere host. We used the same number of connections and observed similar behavior for the throughput. Since the NIC does not have support for hardware LRO, we were getting 800K packets per second for each NIC. With eight 10GbE NICs, the packet rate reached close to 6.4 million packets per second. VMware does Software LRO for Linux and as a result we see large packets in the guest. The guest packet rate is around 240K packets per second. There was also significant traffic for TCP acknowledgements and for each physical NIC. The host was transmitting close to 120K acknowledgement packets for each physical NIC, bringing the total packets processed close to 7.5 million packets per second for eight 10Gb ports.

Figure 3. Average vSphere 5.5 host CPU utilization for send and receive

We also measured the average CPU reported for each of the tests. Figure 3 shows that the vSphere host’s CPU usage increased linearly as we added more physical NICs to the test for both send and receive. This indicates that performance improves at an expected and acceptable rate.

Test results show that vSphere 5.5 is an excellent platform on which to deploy networking-intensive workloads. vSphere 5.5 makes use of all the physical bandwidth capacity available and does this without incurring additional CPU cost.

 


[1]A 10GbE NIC can achieve only 9.4 Gbps of throughput with standard MTU. For a 1500 byte packet, we have 40 bytes for the TCP /IP header and 38 bytes for the Ethernet frame format.

VXLAN Performance on vSphere 5.1

The VMware vSphere I/O performance team has published a paper that shows VXLAN performance on vSphere 5.1. Virtual extensible LAN (VXLAN) is a network encapsulation mechanism that enables virtual machines to be deployed on any physical host, regardless of the host’s network configuration.

The paper shows how VXLAN is competitive in terms of performance when compared to a configuration without VXLAN enabled. The paper describes the test results for three experiments: throughput for large and small message sizes, CPU utilization for large and small message sizes, and throughput and CPU utilization for 16 virtual machines with various message sizes. Results show that a virtual machine configured with VXLAN achieved similar networking performance to a virtual machine without VXLAN configured, both in terms of throughput and CPU cost. Additionally, vSphere 5.1 scales well as more virtual machines are added to the VXLAN network. Read the full paper here.

Multicast Performance on vSphere 5.0

Multicast is an efficient way of disseminating information and communicating over the network. A single sender can connect to multiple receivers and exchange information while conserving network bandwidth. Financial stock exchanges, multimedia content delivery networks, and commercial enterprises often use multicast as a communication mechanism. VMware virtualization takes multicast efficiency to the next level by enabling multiple receivers on a single ESXi host. Since the receivers are on the same host, the physical network does not have to transfer multiple copies of the same packet. Packet replication is carried out in the hypervisor instead.

In releases of vSphere prior to 5.0, the packet replication for multicast is done using a single context. When there is a high VM density per host, at high packet rates the replication context may become a bottleneck and cause packet loss. VMware added a new feature in ESXi 5.0 to split the cost of replication across various physical CPUs. This makes vSphere 5.0 a highly scalable and efficient platform for multicast receivers. This feature is called splitRxMode, and it can be enabled with a VMXNET3 virtual NIC. Fanning out processing to multiple contexts causes a slight increase in CPU consumption and is generally not needed for most systems. Hence, the feature is disabled by default. VMware recommends enabling splitRxMode in situations where multiple VMs share a single physical NIC and receive a lot of multicast/broadcast packets.

To enable splitRxMode for the Ethernet device:                                                

  1. Select the virtual machine you wish to change, then click Edit virtual machine settings.
  2. Under the Options tab, select General, then click Configuration Parameters.
  3. Look for ethernetX.emuRxMode. If it's not present, click Add Row and enter the new variable.
  4. Click on the value to be changed and configure it as “1”. For default behavior the value is “0”.

Environment Configuration

  • Systems Under Test: 2 Dell PowerEdge R810
  • CPUs: 2 x 8 core Intel Xeon CPU E7- 8837  @ 2.67GHz (no hyperthreading)
  • Memory: 64GB
  • NICs: Broadcom NetXtreme II 57711 10Gb Ethernet

Experiment Overview

We tested splitRxMode by scaling the number of VMs on a single ESX host from 1 to 36 VMs with each VM receiving up to 40K packets per second. We had a consolidation ratio of 2 VMs per physical core when 32 VMs were powered on. The sender was a 2-vCPU RHEL VM on a separate physical machine transmitting 800-byte multicast packets at a fixed interval. The clients (receiving VMs) were 1-vCPU RHEL VMs running on the same ESXi host. Each receiver was using 10-15% of its CPU power for processing 10K packets per second, and the usage increased linearly as we increased packet rate. No noticeable difference in CPU usage was observed when splitRxmode was enabled. We then measured the total packets received by each client and calculated the average packet loss for the setup.

Experiment Results

The default ESX configuration could run up to 20 VMs, each receiving 40K packets per second with less than 0.01% packet loss. As we powered on more VMs, the networking context became the bottleneck and we started observing packet loss in all the receivers. The loss rate increased substantially as we powered on more VMs. A similar trend was observed for lower packet rates (30K packets per second).                                                                                                                     

 

We repeated the experiment after enabling splitRxMode on all VMs. As seen in the graph below, the new feature greatly increases the scalability of the vSphere platform in handling multicast packets. We were now able to power on 40% more VMs (28VMs) than before, each receiving 40K packets per second with less than 0.01% packet loss. At lower packet rates the performance improvement is even more noticeable, as we couldn’t induce packet loss with even 36 VMs powered on.