100 Gbps Performance with NSX Data Center

NSX Data Center has shown for some time now (see VMworld 2016 NSX Performane Session (NET 8030) that it can drive upwards of 100G of throughput per node for typical data center workloads. In that VMworld session, we ran a live demo showing the throughput being limited by the actual physical ports on the host, which were 2 x 40 Gbps, and not by NSX Data Center.

Typically, in physical networking, performance is measured in raw packets per seconds to assure variety of traffic at variable packet sizes be forwarded between multiple physical ports. While in virtualized data center this is not a case, as hypervisor hosts only have to satisfy few uplinks, typically no more than four physical links. In addition, most of the virtualized workload use TCP protocol. In that case. ESXi hypervisor fowards the TCP data segments in highly optimized way, thus not always based on number of packets transferred but the amount of data segment forwarded in software. In typical data center workloads, TCP optimizations such as TSO, LRO and RSS or Rx/Tx Filters help drive sufficient throughput at hardly any CPU cost. TSO/LRO help move large amounts of data with minimal overhead and hence are essential to achieve high throughput with lower packets per second – similar to using jumbo MTU. These optimizations are not new to NSX but tested and proven TCP stack optimizations that existed for quite some time now in the non-overlay world. What’s new is, leveraging these existing, tested and tried methods for overlays such as Geneve. If you are not familiar with these optimizations, I would highly recommend checking out the following VMworld sessions focused on NSX Performance.

  1. 2016 NSX Performance Deep Dive NET8030
  2. 2017 NSX Performance Deep Dive NET1343BU

 

Data Plane Development Kit (DPDK) in NSX Data Center

The Data Plane Development Kit (DPDK) is a set of data plane libraries and network interface controller drivers for fast packet processing. DPDK uses a bunch of optimizations around the CPU usage and memory management to help improve the packet processing speed. Compared to the standard way of packet processing, DPDK helps decrease the CPU cost and yet increase the number of packets processed per second.  This DPDK library can be used for variety of use cases and is used by many software vendors. It can be tuned to match desired performance for generalized or specific use-cases.

Now, with the release of NSX-T Data Center 2.2, we’re crossing an important milestone regarding capabilities for software-based performance.  In NSX platform it is used in two ways, first as a dedicated general-purpose network appliance called Bare Metal Edge Node and secondly at the host level – Enhance Datapath Mode – optimizing host throughput for specific use case.

In this blog, we will cover both: DPDK based bare metal NSX Edge gateway and the Enhanced Data Path mode in the NSX-T Data Center distributed switch.

 

Bare Metal Edge

NSX-T Data Center introduced a DPDK based Bare Metal Edge to help drive two specific need, first for applications requiring higher packets range with variable packet sizes and secondly to drive sufficient throughput for servers supporting high speed NIC at 25/50/100 Gbps for north/south traffic. Typically, the north/south flows going through the NSX Edges, even though they constitute less than 20% of the total traffic flows in the datacenter, have a wide variety of packet sizes and higher packet processing requirements. The bare metal edge also supports variety of services such as NAT and Edge firewall in conjunction with high throughput and variable packet sizes.

The following image shows how the wide variety of packet sizes with diverse requirements flow north/south through the NSX Edges.

NSX-T Data Center introduced Data Plane Development Kit (DPDK) based edges to tackle these heavy packet-processing requirements on the north/south side of the datacenter.  Refer to the NSX-T Data Center installation guide for HCL for Bare Metal Edge.

 

Layer 2 Bridge Support with DPDK Based Edge

With the 2.2 release, NSX-T Data Center now supports the ability to bridge layer 2 between Geneve encapsulated networks and VLAN based networks on the Edge node. This allows the NSX Data Center admin the ability to leverage the high performance, DPDK based feature provided by NSX Edge nodes. In addition, an NSX Data Center admin can also implement a firewall at this layer 2 boundary.

 

Expanding NSX-T Data Center Use Cases with an Enhanced Datapath

With NSX-T Data Center 2.2, we have introduced a new mode for the N-VDS (the NSX managed virtual distributed switch in NSX-T Data Center). The Enhanced Datapath mode brings the advantages of DPDK style packet processing performance to the east/west flows within the DC.  This switch mode is designed to support NFV type applications and is not suitable for generic data center applications or deployments where traditional VM based or bare metal Edge node must be used.

With NFV, the focus shifts from raw throughput to packet processing speed given the impact this has to VNF density. In these workloads the applications are not often sending smaller number of large packets but rather more often are sending many smaller packets. Often, these packets are as small as 128 bytes. TCP optimizations do not help with these workloads. Hence, Enhanced Datapath leverages DPDK to deliver performance for these packet processing focused workloads.

Before we take a look at what this new mode for the N-VDS is and how it helps, let’s take a step back and first quickly address the misconceptions surrounding performance in software and secondly look at the plumbing that went into creating one of the most powerful packet processing switches in software

 

Demystifying Factors Influencing Performance in Software

Folks coming from hardware based networking background, often incorrectly assume that the performance of NSX may not match the hardware. However, once they start looking at the fundamental differences in how hardware vs software components operate, it becomes easier to understand how NSX delivers performance on par with hardware based solution.

For example, jumbo MTU is a recommended approach for higher throughput in physical networking world. However, jumbo MTU is today generally ~9000 bytes only. NSX on the other hand, leveraging existing TCP optimizations for workloads designed for throughput, actually handles 32K – 64K segments which is several times larger then Jumbo MTU and is perfect for achieving high throughput.   For each 64K segment that goes through the NSX stack, the physical switches may have to run at ~40 (64K/1500) packets per second for the same payload.  Which means, NSX is able to send the same amount of pay load with a single operation(one packet processed), that would take ~40 operations (~40 packets processed) on the physical network layer with MTU set to 1500.  Hence packets per second metric, does not matter for performance in software where the applications are tuned for throughput and leverage TCP optimizations.  VMware IO Compatibility Guide provides list of drivers for specified NICs with support for TCP optimization features such as Geneve offload.

Another example: hardware networking gear often has 24+ ports to cater to in a 1 RU or 2 RU form factor. NSX on the other hand is distributed system that’s spread out across racks of servers – each server handling 2 – 4 physical NICs in general.

For a closer look on this subject, please check out the VMworld 2015 NSX Performance session.

Need for DPDK Based Forwarding

Traditional method of packet processing is interrupt driven. In this model, when packets arrive, the pNIC (Physical NIC) sends an interrupt to the CPU. One of the cores gets assigned to handling the received packets. In this process, the core that was assigned will generally have to do a context switch from its current operation to packet processing. Once the set of packets that were received are processed, the core is reassigned to a different job. This context switching, while efficient for typical virtual datacenter workloads, becomes expensive with applications that have high packet processing requirements.

As explained in the previous paragraph, there is no core affinity by default for packet processing. Again, while this is not relevant in typical datacenter workloads, for heavy packet processing applications, this may result in inconsistent latency profile.

Given the above limitations, while this model is excellent for applications that are designed for driving high throughput, it does not help in NFV style applications where the focus is on raw packet processing.

 

Elements of Data Plane Development Kit (DPDK)

The Data Plane Development Kit (DPDK) is a set of data plane libraries and network interface controller drivers for fast packet processing.

 

Poll Mode Driver

One of the key changes with DPDK is the Poll Mode Driver (PMD). With the Poll Mode Driver, instead of the NIC sending an interrupt to the CPU once a packet arrives, a core is assigned to poll the NIC to check for any packets. This eliminates the CPU context switching, which is unavoidable in the traditional interrupt mode of packet processing.

 

CPU Affinity and Optimization

With DPDK, dedicated cores are assigned to process packets. This ensures consistent latency in packet processing. Also, instruction sets such SSE, which helps with floating point calculations, are enabled and are available where needed.

 

Buffer Management

Buffer management is optimized to represent the packets being processed in simpler fashion with low footprint. This helps in faster memory allocation and processing. Buffer allocation is also Non-uniform memory access (NUMA) aware.

 

DPDK in Action

DPDK, as mentioned earlier, is leveraged by NSX-T Data Center in two ways. One is for north/south traffic, in the NSX Edge and secondly for the Enhanced Datapath.

 

NSX Edge Deployed via Bare Metal or Virtual Machine

NSX-T Data Center usage of DPDK was first introduced with the Bare Metal version of the NSX Edge. This was showcased in the VMworld 2017 NSX Performance session and is a topic for another blog. I would encourage taking a look at this session for more details.

The NSX-T Edge, in my testing, was able to drive 256 bytes packets at close to line rate on a single 10Gbps port when deployed on bare metal. In this test, traffic was flowing from VMs within the overlay to VMs outside of the overlay domain. Tests were run with iperf and packets were forced to confirm to a specific size by manipulating the MTU on the VMs. The following image shows the topology.

The following graph shows the throughput achieved at various packet sizes with a single 10G link.

In real world, MTU size is generally around 1500 with very rare instances of 576 bytes. Packet sizes in those rare 576 MTU instances align close to 512 bytes data point above. The NSX Edge on bare metal forwards traffic at line rate even for small packet sizes as shown in the graph above.

 

Enhanced Datapath

The Enhanced Datapath mode is introduced in NSX-T Data Center 2.2 as shown in below screen shot.

Enhanced Datapath mode is based on the same underlying N-VDS and supports the base switch features such as vMotion. Enhanced Datapath mode brings the packet processing performance of the DPDK to the compute clusters and east/west traffic. The following are some of the features of Enhanced Datapath mode that drive performance:

 

CPU and Memory Optimization

Enhanced Datapath mode uses DPDK’s Poll Mode driver that dedicates a core for packet processing and thus reduces context change overhead. This helps increase the packet processing performance. DPDK is also NUMA aware and balances the alignment of VM with the system threads and the underlying physical uplinks which results in lower and consistent latency using lcore, which is a logical execution unit of the processor, also known as a hardware thread.

 

Packet Descriptors

Enhanced Datapath uses mbuf, a library to allocate and free buffers to hold packet related info with low overhead, instead of regular packet handlers for packets. Traditional packet handlers have heavy overhead to initialize. With mbuf, packet descriptors are simplified. This decreases the CPU overhead for packet initialization. VMXNET3 has also been enhanced to support the mbuf based packet.

Apart from the above from DPDK, ESX TCP Stack has also been optimised with features like Flow Cache

 

Flow Cache

Flow Cache is an optimization that helps reduce the CPU cycles spent on known flows. Flow Cache tables get populated with the start of a new flow. Decisions for the rest of packets within a flow, may be skipped if the flow already exists in the flow table.

Flow cache uses two mechanisms to figure out the fast path decisions for packets in a flow:

If the packets from the same flow arrive consecutively, then the fast path decision for that packet is stored in memory and applied directly for the rest of the packets in that cluster of packets.

If the packets are from different flows, then the decision per flow is saved to a hash table and used to decide the next hop for each of the packets of the flows.

 

Enhanced Datapath Mode Drivers

Enhanced Datapath mode is supported on Intel 82599 series and Intel 710 series cards. DPDK enabled drivers are required to enable the functionality on the ESX hosts. Drivers are available on my.vmware.com portal:

 

Conclusion

Over the years we have shown how NSX Data Center was always ahead of the game from a performance perspective. With the introduction of DPDK enabled NSX-T Edges, we brought unparalleled packet performance to the off-ramp, ie., north/south flows. Now with the NSX-T Data Center 2.2 release, we introduce DPDK enabled Enchanced Datapath mode that enables high performance packet processing for NFV style workloads on the east/west front.

 

References

  1. Announcing General Availability of VMware NSX-T Datacenter 2.2
  2. NSX-T Data Center 2.2 Release Notes
  3. NSX-T Data Center Installation Guide
  4. VMware IO Compatibility Guide
  5. DPDK
  6. mbuf
  7. lcore
  8. Geneve IETF Draft
  9. VMworld 2015 – NSX Performance (NET5212)
  10. VMworld 2016 – NSX Performance Deep Dive (NET8030)
  11. VMworld 2017 –  NSX Performance Deep Dive (NET1343BU)