Technical VCF Storage (vSAN)

Adaptive Network Traffic Shaping with the vSAN Express Storage Architecture

The Express Storage Architecture (ESA) in vSAN 8 can deliver performance and efficiency in ways that were unachievable in the Original Storage Architecture (OSA) of vSAN. vSAN can process and store data at a rate higher than ever before, which inevitably leads to contention in other parts of the stack when resources are under high load. To accommodate this, VMware introduces new intelligence to automatically provide the optimal balance of vSAN traffic under times of network contention.

Let’s learn more about the adaptive network traffic shaping with the vSAN ESA, and why it is so important.

Managing a Shifting Bottleneck

Storage devices have often been the primary bottleneck in a storage system. Whether it was rotational latencies spinning disks in hybrid systems, or older SAS and SATA interfaces that limited the performance capabilities of NAND flash, the network was rarely the primary bottleneck, unless something was configured wrong.

Managing I/O that is contending for resources is important, as it can help prioritize the different types of storage traffic optimally. Resynchronizations are important to vSAN’s ability to ensure data is compliant with the prescribed storage policy, and to balance data across the cluster. vSAN’s Adaptive Resync has been extremely effective in regulating the flow of resynchronization I/O versus VM I/O when this type of storage device contention in a host occurs.

The data path efficiency of the vSAN ESA paired with the extreme performance capabilities of NVMe-based TLC flash devices can be processed at an extremely high rate. If the workloads demand it, the ESA can push throughput to near device-level rates. This higher rate of processing I/Os in a server means higher rates of I/O traversing the network, which can potentially lead to the network being the bottleneck when running highly demanding workloads during times of resynchronization.

To help accommodate for this, the ESA in vSAN 8 includes an adaptive traffic shaping capability for vSAN I/O traversing a network. This helps ensure that when network contention occurs, that vSAN will properly prioritize VM I/O over resynchronization activity. This can help deliver more consistent performance for these demanding workloads that may otherwise be saturating a network link.

Recommendation. Size your network by the vSAN ReadyNodes you have selected for use with the ESA. The adaptive network traffic shaping capability provides a way to prioritize different types of vSAN traffic contending for the same network resources – especially when resynchronizations are occurring. Under non-resynchronization conditions, fully saturated network links are an indication of an undersized network and are an ideal candidate for faster, higher bandwidth lower latency networking.

The networking requirements for vSAN clusters using the ESA are higher than what you’d see with the OSA. This can be easily misinterpreted that the vSAN ESA is more demanding on the network, which is not true. While ReadyNode profiles will define 25Gb or 100Gb as the minimum requirement, this requirement reflects the ESA’s ability to deliver near the device-level performance of the high-performing NVMe-based storage devices approved for use. Ensuring sufficient network resources allows for vSAN to exploit the full performance capabilities of the devices under maximum load. If you are migrating production workloads from a vSAN OSA cluster to a vSAN ESA cluster, on average you will see fewer CPU and network resources used for those same workloads. This is because the vSAN ESA uses fewer CPU cycles and fewer network resources to process and store I/O when compared to the vSAN OSA.

How it Works

The adaptive network traffic shaping found in the vSAN 8 ESA shares a similar approach to the logic found in Adaptive Resync, but instead of managing vSAN I/O in the storage stack in a vSAN hosts, this manages vSAN I/O on the network. Under times of no contention, the different types of vSAN traffic can use as much network bandwidth as available. As shown in Figure 1, if vSAN traffic is saturating the link, it will automatically detect and throttle the traffic so that no more than 20% of the bandwidth will be used for resynchronization traffic. This will help ensure that VM I/O is always delivered with a higher priority than resynchronization I/O. Much like Adaptive Resync, it is completely automated.

Figure 1. Adaptive network traffic shaping for vSAN traffic in the Express Storage Architecture.

The mechanism works on a per-host basis, as this prevents some of the challenges that can occur with aggregate style measurements in a distributed storage system. The distributed object manager of each host monitors the round-trip (RTT) latency of vSAN I/Os sent from one host to another and determines a course of action based on the dynamic range it falls within. The higher the latency, the less resynchronization traffic is allowed, and the lower the latency, the more resynchronization traffic is allowed.

The measurements are collected and analyzed by a weighted rolling average to give more recent signals more weight than older signals. The signaling is what helps vSAN determine the course of action: If it should throttle, where, and by how much. There are other mechanisms in place that help vSAN step up and step down the throttle in a controlled way, and prevent bandwidth from constraining any particular type of I/O too much. For vSAN stretched clusters, this feature does not attempt to manage the inter-site link (ISL) of a vSAN stretched cluster, but will manage the vSAN network I/O traffic within a given site.

The vSAN adaptive network traffic shaping does not replace Network I/O Control (NIOC). It works within what it is granted by NIOC or any other logical or physical limitations imposed on the network. It is also quite possible that with the vSAN ESA, you may not see any (or very little) congestion metrics. The congestion metric served as a feedback mechanism to measure contention on the storage devices and take appropriate action. With the ESA, the devices will rarely be the point of contention.

Summary

vSAN’s new adaptive network traffic shaping for the ESA does for network contention what Adaptive Resync does for storage device contention. It is an impressive new capability that manages vSAN traffic over the network automatically so that you don’t have to.

@vmpete