Large Receive Offload (LRO) Support for VMXNET3 Adapters with Windows VMs on vSphere 6

Large Receive Offload (LRO) is a technique to reduce the CPU time for processing TCP packets that arrive from the network at a high rate. LRO reassembles incoming packets into larger ones (but fewer packets) to deliver them to the network stack of the system. LRO processes fewer packets, which reduces its CPU time for networking. Throughput can be improved accordingly since more CPU is available to deliver additional traffic. On Windows, LRO is also referred to as Receive Segment Coalescing (RSC).

LRO has been supported for Linux VMs with kernel 2.6.24 and later since vSphere 4. With the introduction of Windows Server 2012 and Windows 8 supporting LRO, vSphere 6 now adds support for LRO on a VMXNET3 adapter on Windows VMs. LRO is especially beneficial in the virtualized environment in which resources are shared by multiple VMs. This blog shows the performance benefits of using LRO for Windows VMs running on vSphere 6.

Test-Bed Setup

The test bed consists of a vSphere 6.0 host running VMs and a client machine that drives workload generation. Both machines have dual-socket, 6-core 2.9GHz Intel Xeon E5-2667 (Sandy Bridge) processors. The client machine is configured with native Red Hat Enterprise Linux 6 that generates TCP flows. The VMs in the vSphere host run Windows 2012 Server and are configured with 4 vCPUs and 2GB RAM. Both machines have an Intel 82599EB 10Gbps adapter installed, which are connected using a 10 Gigabit Ethernet (GbE) switch.

Performance Results

1. Native to VM Traffic

Figures 1 and 2 show a CPU efficiency and throughput comparison with and without LRO when TCP streams are generated from the client machine to a VM running on the vSphere host. Netperf is used to generate traffic. Three different message sizes are used: 256 bytes, 16KB, 64KB. The socket size is set to 8K, 64K, and 256K respectively. The message size is used to determine the number of bytes that Netperf delivers to the TCP stack in the client machine, which then determines the actual packet sizes. The NIC in the client machine splits packets with a size larger than the MTU (1500 is used for this blog) into smaller MTU-sized ones before sending them out. Once packets are received in the vSphere host, LRO aggregates packets and delivers larger packets (but smaller in number) to the receiving VM running in the host. This process is done in either hardware (that is, the physical NIC) or software (that is, the vSphere networking stack) depending on the NIC type and configuration. In this blog, LRO is performed in the vSphere networking stack before packets are delivered to the VM.

CPU efficiency is calculated by dividing the throughput by the number of CPU cores used for both the hypervisor and the VM, representing how much throughput in gigabits per second (Gbps) a single CPU core can receive. For example, a CPU efficiency of 5 means the system can handle 5Gbps with one core. Therefore, a higher CPU efficiency is desirable.

As shown in Figures 1 and 2, using LRO considerably improves both CPU efficiency and throughput with all three message sizes. Figure 1 shows that CPU efficiency improves by 86%, 25%, and 33% with 256 byte, 16KB, and 64KB messages respectively, when compared to the case without LRO. Figure 2 shows throughput improves 54% for 256 byte messages (0.6Gbps to 0.9Gbps) and 5% for 16KB messages (9.0Gbps to 9.4Gbps). There is not much difference for 64KB packets (9.5Gbps to 9.4Gbps, this is within a rage of normal variance). With 16KB and 64KB messages, throughput with LRO is already at line rate (that is, 10Gbps), which is why the improvement is not as significant as CPU efficiency.

Figure 3 compares the packet rate and the average packet size between the messages with LRO and without LRO. They are measured right before packets are delivered to the VM (but after LRO is performed for the LRO configuration). It is clearly shown that fewer but larger packets are delivered to the VM with LRO. For example, with 64KB messages, the packet rate delivered to the VM decreases from 815K packets per second (pps) to 113Kpps with LRO, while the packet size increases from 1.5KB to 10.5KB. The number of interrupts generated for the guest also becomes smaller accordingly, helping to improve overall CPU efficiency and throughput.

When the client machine generates packets, those with a size larger than the MTU are split into smaller MTU-sized packets before being sent out. With a larger message size, more MTU-sized packets are produced and the packet rate received by the vSphere host increases accordingly. This is why the packet rate becomes higher for 16KB and 64KB messages than 256 byte messages without LRO in Figure 3. LRO aggregates the received packets before they hit the VM so the packet rate remains low regardless of the message size in the figure.

Figure 1. CPU efficiency comparison with and without LRO with Native-VM traffic

Figure 2. Throughput comparison with and without LRO with Native-VM traffic

Figure 3. Packet rate and size comparison with and without LRO with Native-VM traffic, when packets are delivered to the VM

2. VM to VM Local Traffic

LRO is also beneficial in VM-VM local traffic where VMs are located in the same host, communicating with each other through a virtual switch. Figures 4 and 5 depict CPU efficiency and throughput comparisons with and without LRO with two VMs on the same host sending and receiving TCP flows. The same message and socket sizes as the Native-VM tests above are used.

From Figure 4, LRO improves CPU efficiency by 15%, 92%, and 90% with 256 bytes, 16KB, and 64KB byte messages respectively, when compared to the case without LRO. Figure 5 shows throughput also improves by 20% with 256 byte messages (0.8Gbps to 1.0Gbps), by 103% with 16KB messages (9.0Gbps to 18.4Gbps), and by 142% with 64KB (11.7Gbps to 28.4Gbps).

Without LRO, big packets with a size larger than the MTU need to be split before delivered to the receiving VM, similar to Native-VM traffic. This is because the receiver cannot handle those big packets. LRO saves the time spent in both splitting packets and receiving smaller packets since packet splitting also happens on the vSphere host with VM-VM Local traffic. This explains why the improvement with 16KB and 64KB messages is higher than that of Native-VM traffic. The absolute CPU efficiency in VM-VM local traffic can become lower than that in Native-VM traffic since the CPU time of both sending and receiving VMs are included for this calculation.

As expected, the packet rate decreases while the average packet size increases with LRO as shown in Figure 6. For example, with 64KB messages, the packet rate delivered to the VM becomes reduced from 1009Kpps to 240Kpps, while the packet size gets increased from 1.6KB to 14.9KB.

The packet rate becomes higher for 256 byte messages with LRO, most likely because round-trip time (RTT) gets reduced due to the use of LRO. With the average packet size being similar to each other between LRO and without LRO, this effectively helps to improve throughput and correspondingly CPU efficiency, as seen in Figure 4 and 5.

Figure 4. CPU efficiency with and without LRO with VM-VM Local traffic

Figure 5. Throughput comparison with and without LRO with VM-VM Local traffic

Figure 6. Packet rate and size comparison with and without LRO with VM-VM Local traffic, when packets are delivered to the VM

Enable or Disable LRO on a VMXNET3 Adapter on a Windows VM

LRO is enabled by default for VMXNET3 adapters on vSphere 6.0 hosts, but you must set RSC to be enabled globally for Windows 8 VMs. For more information about configuring this, see the documentation.

Conclusion

This blog shows that enabling LRO for Windows Server 2012 and Windows 8 VMs on a vSphere host using VMXNET3 considerably enhances the CPU efficiency and correspondingly improves throughput for TCP traffic.