Determining the root cause of performance issues in any environment can be a challenge, but with environments running dozens, if not hundreds of virtual workloads, pinpointing the exact causes, and understanding the options for mitigation can be difficult for even the experienced administrator. Since a vSAN cluster is made up of locally-attached disks, there are some truly insightful tools built into vCenter that can monitor performance from the application all the way down to the individual storage disk.
The purpose of this post is to highlight the key areas for monitoring vSAN performance. For detailed information on troubleshooting vSAN performance be sure to read the Troubleshooting vSAN Performance white paper by Pete Koehler.
Monitoring vSAN Performance
Historically administrators have had numerous options on where to measure performance data: inside the guest VM at various layers in the hypervisor, and when using traditional three-tier storage at the storage array. Unfortunately, these various options often provide incomplete or inaccurate results. The storage array, for example, doesn’t have end-to-end insight into the entire storage path the VM uses. Measuring data at the VM can sometimes be equally as challenging because the Operating System assumes sole proprietorship of resources and can’t account for the sharing of resources. vCenter is the ideal monitoring plane as data is pulled from the most intelligent location in the stack: the hypervisor. With vSAN-powered environments, vCenter can see and control storage from end-to-end.
Now that we’ve established the best place to monitor performance, let’s take a closer look at the ways in which vSAN performance can be monitored.
Guest VM Level
Application and OS behaviors tied to the VM can be viewed at the guest level. Metrics such as IOPS, throughput, and latency can be viewed at this level and will establish an understanding of the effective behavior that your critical VM is seeing at the ESXi vSCSI layer, isolating the source more quickly. In addition to the Monitor > Performance > Advanced, vSAN admins can also click on Monitor > vSAN > Performance and then select the VM or the Virtual Disks tabs for analysis of independent virtual disks for the VM.
Cluster Level
vSAN is a cluster-based storage solution which means that VM data is not necessarily residing on the same host that VM is residing. More commonly, parts of the VM will be distributed across several hosts depending on the amount of “Failures to Tolerate” (FTT) dictated by the storage policy. For example, a VM with a FTT=1 and RAID-1 Mirror will have its disks (components) residing on two hosts with a witness component on a third host. More on vSAN Data Placement and Availability here. Measuring at the cluster level can often help provide an understanding of the level of activity across the cluster.
Host Level
As mentioned previously, the storage objects that make up a VM won’t necessarily reside on the same host as the VM. To determine where VM objects are located click on Monitor > vSAN > Physical Disk Placement. Once it has been established on which host/hosts the VM objects are located, host-level monitoring will be very useful. The following performance information is available at the Host level:
- VM level statistics (aggregate of all VM objects on the host)
- Backend vSAN statistics
- Physical disks and disk groups (discussed in more detail in step 4)
- Physical Network adapters
- Host Network and vSAN VMkernel activity
- iSCSI service activity
Storage Devices and Disk Groups
vSAN architecture consists of two tiers: a cache tier for the purpose of read caching and write buffering, and a capacity tier for persistent storage. When monitoring vSAN performance it is important to observe both the cache disk and the capacity disks. Through host-level statistics, vSAN exposes key information about disk groups. Since a disk group is comprised of at least one caching/buffering device and one or more capacity devices, this view can expose a lot of insight into the behavior of an environment including, but not limited to, cache disk destage rates, write buffer free percentages, and the different types of congestions. Performance data can be viewed from the disk group in its entirety, from the caching/buffering device servicing a disk group, or from a capacity device serving a disk group.
Navigating Across the Different Levels of Performance Metrics
Conclusion
vSAN is now being used for the most performant applications (databases, SAP HANA, Epic). As a result, it’s becoming more important to monitor performance. However, monitoring performance for a distributed HCI system is different than for a scale-up SAN for the following reasons:
- An inherent advantage of HCI is that performance can be seamlessly added over time
- Performance is aggregated across nodes and should be measured at both a cluster and host level
VMware provides tools that simplify HCI performance management. As a result, our customers can be fully confident in using vSAN for the highest performance applications.