Performance Troubleshooting - Which vSAN Performance Metrics Should be Looked at First?

Troubleshooting performance issues can be a complex undertaking for many administrators, regardless of the underlying infrastructure and topology. A distributed storage platform like vSAN introduces other elements that can influence performance as well, and the practices for troubleshooting should accommodate for those considerations. The guidance provided below will help an administrator use the metrics found in the vSAN performance service to isolate the sources of the performance issue.

When troubleshooting performance issues in a vSAN environment, two of the most common questions I get asked are 1.) Which metrics are most important? And 2.) What order should I look at the metrics? Let’s address these two specific questions so that you can take action in your own environment more easily.

Reviewing the Performance Troubleshooting Workflow

First let’s take a look at the basic framework for troubleshooting performance in a vSAN environment, as shown in Figure 1. Each of the 5 steps is critical to improving the likelihood that the root cause is identified properly, and steps for mitigation are done in a systematic way.

Figure 1. vSAN performance troubleshooting workflow

The leading indicator of any storage-related performance issue on an active VM is guest VM latency. For the virtualization administrator, they are typically made aware of this by one of two alerting mechanisms: Complaints from users or administrators, or system alerts that monitor performance thresholds. Determining the cause of the latency is where the troubleshooting process begins, as outlined in Troubleshooting vSAN Performance on StorageHub.

Which Metrics are Most Important?

Unfortunately, this is not a clear cut answer, as the metrics available in the vSAN performance services all relate to each other in some form or another. The conditions of the environment and the root cause of a performance issue will dictate which metrics are more significant than another. This is the reason why the discovery process (steps #2 and #3 in the troubleshooting workflow) are so critical to the process. It is important to understand the conditions of the environment prior to gaining true insight from the performance metrics. A discrete metric may provide very little assistance when viewed in isolation, but be meaningful when viewed with other metrics.

Storage latency is the most distinguished of all storage performance metrics, as it defines the time to complete/acknowledge the I/O delivery, and is typically reported in time in milliseconds (ms). It is the time the system has to wait to process subsequent I/Os, or execute other commands waiting for that I/O. With the hypervisor, latency measurements can be taken for just a portion of the storage stack (visible via ESXTOP), or the entire end-to-end path: from the VM, to the storage device. Note that latency is a conditional metric. It gives no context to the amount of I/Os that are feeling that latency. It also only representing the location from where it is being measured. Latency can be measured at numerous locations up and down the storage stack. This is why the order that the metrics are viewed becomes critical.

The Order of Review for Metrics

Once the discovery steps in the troubleshooting framework have been completed, the process of using the performance metrics can begin. The order in which the metrics are viewed can be helpful in deciphering what level contention may be occurring. Figure 2 shows the order in that the metrics can be viewed to better understand and isolate the issue, and is the same order used in “Appendix C: Troubleshooting Example” in the Troubleshooting vSAN Performance document.

Figure 2. Viewing order of performance metrics

Below we provide a bit more context to each step:

View metrics at the VM level to confirm VM in question is experiencing unusually high storage
related latency. This must be verified that there is in fact storage latency as seen by the guest VM.
View metrics at the cluster level to provide context and look for any other anomalies. This will help identify potential “noise” coming from somewhere else in the cluster.
View metrics on host to isolate type of storage I/O associated with identified latency.
View metrics on host, looking at the disk group level to determine type and source of latency.
View metrics on host, looking at host network and VMkernel metrics to determine if the issue is network
related.

Step #3 through #5 assume that one has identified the specific hosts where the VM’s objects reside, which can be easily accomplished in the vCenter UI. For the purpose of simplicity, host level metrics should look at only the hosts where the objects reside for the particular VM in question.

Recommendation: Be diligent, and deliberate when making changes to your environment in an effort to improve performance. Changing multiple settings at once, overlooking a simple configuration issue, or not measuring the changes in performance can often make the situation worse, and more complex to resolve.

Summary

While the process of tracking down the primary contributors to performance issues can be complex, there are practices that can help simplify this process and improve the time to resolution. This information, paired with the “Troubleshooting vSAN Performance” guide on StorageHub is a great start to better understanding how to diagnose and address performance issues in your own vSAN environment.

@vmpete

Discover more from VMware Cloud Foundation (VCF) Blog

Subscribe to get the latest posts sent to your email.

Reviewing the Performance Troubleshooting Workflow

Which Metrics are Most Important?

The Order of Review for Metrics

Summary

Discover more from VMware Cloud Foundation (VCF) Blog

Related Articles

NVMe Memory Tiering Design and Sizing on VMware Cloud Foundation 9 Part 4: vSAN Compatibility and Storage Considerations

Using Harbor as a Proxy Cache for Cloud-Based Registries

What to Look for in Network Switches for VMware vSAN