VMware vSAN has had a profound impact on the design, operation, and optimization of the modern data center. The shift to hyper-converged infrastructures (HCI) not only changes how data centers are designed, it naturally shifts importance from the discrete elements that make up the architecture. By eliminating a shared storage device found in traditional three-tier architectures, it converges the responsibilities onto the hosts, and the fabric that connects them. This means that storage I/O traffic that once moved across an isolated fabric to a storage array, now runs on IP networks to the other hosts.
Conventional wisdom implies that reliable connectivity is important. Yet, though lost in this truism is just how important a fast and reliable network really is. After all, TCP/IP was built to accommodate the potential for unreliable delivery of data. While it does a good job, there is often little understanding of the level of impact that both transient and persistent conditions in a network create that impede performance. An occasional dropped packet may not be noticed while perusing the internet, but with HCI, the impact can be significant, as HCI relies on the inter-host connectivity to deliver storage I/O in a consistent, and timely manner.
Illustrating the impact of packet loss and latency against IOPS
Let's look at this in more detail. The following two illustrations came from data provided by Andreas Scherr, a Sr. Solutions Architect at VMware, who presented on this topic with Cormac Hogan at VMworld in Las Vegas last year.
As shown in Figure 1, we see with just a 1% packet loss, there is a 10% degradation in write I/Os per sceond (IOPS). With a 2% packet loss, IOPS are reduced by 32%. The dramatic fall-off doesn't stop there, with a 77% reduction in IOPS occurring with a 5% packet loss. When the network experiences a 10% packet loss, IOPS dropped by more than 92%.
Figure 1. Impact of Network Packet Loss on IOPS
The performance impact illustrated in Figure 1 is quite consistent with any type of traffic using a connection-oriented protocol, running over a network connection suffering packet loss. This type of packet loss can happen for many reasons. There could be issues with the host NIC devices and drivers, network cabling, network connectors, or the switches. Depending on the cause of the issue, this behavior may surface only as the demand increases on the environment. No matter what the cause, the effect is still the same; retransmissions, and the dramatic reduction in performance.
In Figure 2, we see that introducing latency into an environment has a much more predictable and linear impact on the effective number of IOPS that can be delivered. When the latency is 5 milliseconds (ms), IOPS are reduced by 30%. When latency is increased to 10ms, IOPS are reduced by 50%.
Figure 2. Impact of latency on IOPS
Latency can occur anywhere in the stack. The latency observed by a VM is the sum of all latency introduced by various resources, as an I/O traverses through the stack. Packet loss and latency can have a compounding effect on each other. An environment suffering from various levels of packet loss will suffer more when subject to higher latencies that occur anywhere in the stack.
Using vSAN to identify packet loss and latency
Thankfully, VMware vSAN provides great visibility to better understand packet loss and latency. Since vSAN is integrated directly into the hypervisor, it measures the right data, from the right location, and in the right way. The performance service found in vSAN goes into detail on several critical metrics. Its ability to identify discrete elements of a system is how it provides granular detail on things like resynchronization traffic.
The vSAN performance service has always been able to measure latency at various levels. vSAN 6.6 extended visibility to tracking packet loss rates. As shown in Figure 3, packet loss can be identified based on a specific physical host adapter, and whether it occurred on inbound or outbound traffic.
Figure 3. Packet loss rate based on Physical adapters, as found in the vSAN Performance Service
Packet loss rates can also be identified on a VMkernel adapter, or an aggregate of VMkernel adapters, as shown in Figure 4. The latter can be helpful when vSAN traffic is configured to use more than one VMkernel per host.
Figure 4. Packet loss rate based on VMkernel adapters, as found in the vSAN Performance Service
The performance service allows a user can specify a window of time ranging from 1 hour to 24 hours, over the course of a 90-day period to track down and isolate particular events. Other tools like VMware vRealize Operations and vRealize Log Insight can also be used to augment and alert the administrators of defined conditions.
The need for fast, deterministic storage performance places an additional level of emphasis on the consistent, and reliable delivery of network traffic among hosts in an HCI environment. Use the vSAN performance service as your first step in gaining better visibility to an element of an infrastructure that is often overlooked – and underappreciated in its influence on performance.