vSAN

vSAN Capacity Management and Monitoring Part 2

This article picks up where Part 1 of this series left off. We will discuss the alarms and notifications in vSAN Health and vCenter Server that can be used to keep an eye on vSAN datastore capacity usage.

vSAN Health

As mentioned in the previous article, vSAN Health monitors the capacity of the physical drives that are used for vSAN capacity (not vSAN cache drives). The Disk Capacity health check in the vSAN Health UI shows a green check as long as all of the capacity drives are less than 80% utilized. If a drive’s capacity usage is between 80% and 95%, a yellow warning is displayed. Utilization above 95% results in a red alert. It is easy to see the recommendation here: It is best to keep the utilization of all capacity drives below 80%.

vSAN features a reactive rebalance mechanism that automatically starts when a physical drive crosses the 80% utilized mark. This rebalancing operation attempts to migrate data from drives that have usage above 80% to less-utilized vSAN capacity drives in the vSAN cluster. If you are concerned that the traffic generated by rebalancing might impact virtual machine traffic, no worries. vSAN’s Adaptive Resync feature monitors and dynamically adjusts resource utilizations to avoid resync traffic contention with virtual machine traffic. More details on this can be found in the Adaptive Resync Tech Note.
vSAN Health also produces a notification if the difference in capacity usage between two or more of the capacity drives is greater than 30%. The purpose is to promote a fairly even balance of resource utilization across the cluster. If the maximum variance between two capacity drives is more than 30%, the vSAN disk balance health check is yellow and the option to start a proactive rebalance is presented.

Note that a proactive rebalance is not started automatically. An administrator must initiate a proactive rebalance either from the vSAN Health UI or by command line. A proactive rebalance can be stopped at any time, if needed. A proactive rebalance is commonly needed after adding capacity to a cluster and when maintenance mode with full data migration is used.

vCenter Server Alarms

There are many vCenter Server alarms for vSAN. You can see the list by clicking a vCenter Server instance in the left column of the vSphere Client, then select Configure > Alarm Definitions and apply a “vsan” filter to the Alarm Name list. The list is quite comprehensive. As you might expect, there is a vCenter Server alarm for the vSAN disk capacity health check. Triggered alarms are visible in the vSphere Client so that you do not have to actively monitor the vSAN Health UI.
The Datastore Usage on Disk alarm is enabled by default for vSAN datastores just like any other datastore. However, the thresholds for this alarm are currently different from the vSAN Health thresholds for capacity. The Datastore Usage on Disk triggers a warning alarm at 75% usage and a critical alarm at 85% usage. Since this alarm is defined at the top level, it is not possible to turn off this alarm just for vSAN datastores. Therefore, you will likely see multiple alarms—vSAN Health and Datastore Usage on Disk—when the vSAN datastore is running low on free space. While it could create a bit of confusion at first, it is good that vSAN and vCenter Server makes it well known when vSAN datastore usage is higher than recommended.
Some administrators prefer additional measures beyond the visual cues in the vSphere Client when a datastore is running out of free space. vCenter Server alarm rules include the ability to send an email notification, send an SNMP trap, and/or run a script when an alarm is triggered. An example use case is sending an email message to an operations team’s email distribution list. Another possible solution would be a script that automates the migration of some low-priority VMs from the vSAN datastore to an NFS datastore to free up vSAN capacity until an administrator can take a closer look.

I’ll wrap up this article by pointing out that it is possible to create custom vCenter Server alarms. This would be useful in cases where you prefer different triggers and actions for multiple datastores. As an example, you could turn off the default alarm (Datastore Usage on Disk) at the top level and create new alarm definitions at the datastore level for each datastore. This naturally requires additional configuration, but offers flexibility and precision when needed for specific monitoring and alerting actions. A search for “vSphere Monitoring and Performance” in the VMware vSphere documentation will provide more details on how to use vCenter Server alarms.
I planned to discuss vRealize Operations, as well, but this article is already fairly long so we will save that for the next article.

vSAN Capacity Management and Monitoring Part 1

vSAN Capacity Management and Monitoring Part 2

vSAN Capacity Management and Monitoring Part 3