This post was co-authored by Justin Pettit, Staff Engineer, Networking & Security Business Unit at VMware, and Ravi Shekhar, Distinguished Engineer, S3BU at Juniper Networks.
As discussed in other blog posts and presentations, long-lived, high-bandwidth flows (elephants) can negatively affect short-lived flows (mice). Elephant flows send more data, which can lead to queuing delays for latency-sensitive mice.
VMware demonstrated the ability to use a central controller to manage all the forwarding elements in the underlay when elephant flows are detected. In environments that do not have an SDN-controlled fabric, an alternate approach is needed. Ideally, the edge can identify elephants in such a way that the fabric can use existing mechanisms to treat mice and elephants differently.
Differentiated services (diffserv) were introduced to bring scalable service discrimination to IP traffic. This is done using Differentiated Services Code Point (DSCP) bits in the IP header to signal different classes of service (CoS). There is wide support in network fabrics to treat traffic differently based on the DSCP value.
A modified version of Open vSwitch allows us to identify elephant flows and mark the DSCP value of the outer IP header. The fabric is then configured to handle packets with the “elephant” DSCP value differently from the mice.
Detecting and Marking Elephants with Open vSwitch
Open vSwitch’s location at the edge of the network gives it visibility into every packet in and out of each guest. As such, the vSwitch is in the ideal location to make per-flow decisions such as elephant flow detection. Because environments are different, our approach provides multiple detection mechanisms and actions so that they can be used and evolve independently.
An obvious approach to detection is to just keep track of how many bytes each flow has generated. By this definition, if a flow has sent a large amount of data, it is an elephant. In Open vSwitch, the number of bytes and an optional duration can be configured. By using a duration, we can ensure that we don’t classify very short-lived flows as elephants. We can also avoid identifying low-bandwidth but long-lived flows as elephants.
An alternate approach looks at the size of the packet that is being given to the NIC. Most NICs today support TCP Segmentation Offload (TSO), which allows the transmitter (e.g., the guest) to give the NIC TCP segments up to 64KB, which the NIC chops into MSS-sized packets to be placed on the wire.
Because of TCP’s slow start, the transmitter does not immediately begin sending maximum-sized packets to the NIC. Due to our unique location, we can see the TCP window as it opens, and tag elephants earlier and more definitively. This is not possible at the top-of-rack (TOR) or anywhere else in the fabric, since they only see the segmented version of the traffic.
Open vSwitch may be configured to track all flows with packets of a specified size. For example, by looking for only packets larger than 32KB (which is much larger than jumbo frames), we know the transmitter is out of slow-start and making use of TSO. There is also an optional count, which will trigger when the configured number of packets with the specified size is seen.
Some new networking hardware provides some elephant flow mitigation by giving higher priority to small flows. This is achieved by tracking all flows and placing new flows in a special high-priority queue. When the number of packets in the flow has crossed a threshold, the flow’s packets from then on are placed into the standard priority queue.
This same effect can be achieved using the modified Open vSwitch and a standard fabric. For example, by choosing a packet size of zero and threshold of ten packets, each flow will be tracked in a hash table in the kernel and tagged with the configured DSCP value when that flow has generated ten packets. Whether mice are given a high priority or elephants are given a low priority, the same effect is achieved without the need to replace the entire fabric.
Handling Elephants with Juniper Devices
Juniper TOR devices (such as QFX5100) and aggregation devices (such as MX, EX9200) provide a rich diffserv model CoS to to achieve these goals in the underlay. These include:
- Elaborate controls for packet admittance with dedicated and shared limits. Dedicated limits provide a minimum service guarantee, and shared limits allow statistical sharing of buffers across different ports and priorities.
- A large number of flexibly assigned queues; up to 2960 unicast queues at the TOR and 512K at the aggregation device.
- Enhanced and varied scheduling methods to drain these queues: strict and round-robin scheduling with up to 4-levels of hierarchical schedulers.
- Shaping and metering to control the rate of injection of traffic from different queues of a TOR in the underlay network. By doing this, bursty traffic at the edge of the physical network can be leveled out before it reaches the more centrally shared aggregation devices.
- Sophisticated controls to detect and notify congestion, and set drop thresholds. These mechanisms detect possible congestion in the network sooner and notify the source to slow down (e.g. using ECN).
With this level of flexibility, it is possible to configure these devices to:
- Enforce minimum bandwidth allocation for mice flows and/or maximum bandwidth allocation for elephant flows on a shared link.
- When experiencing congestion, drop (or ECN mark) packets of elephant flows more aggressively than mice flows. This will result in TCP connections of elephant flows to back off sooner, which alleviates congestion in the network.
- Take a different forwarding path for elephant flows from that of mice flows. For example, a TOR can forward elephant flows towards aggregation switches with big buffers and spread mice flows towards multiple aggregation switches that support low-latency forwarding.
By inserting some intelligence at the edge and using diffserv, network operators can use their existing fabric to differentiate between elephant flows and mice. Most networking gear provides some capabilities, and Juniper, in particular, provides a rich set of operations that can be used based on the DSCP. Thus, it is possible to reduce the impact of heavy hitters without the need to replace hardware. Decoupling detection from mitigation allows each to evolve independently without requiring wholesale hardware upgrades.