vSAN Health check is a cloud-connected health validation framework. It helps ensure that vSAN is running optimally and configured according to VMware best practices. VMware recommends validating that all health checks are green while deploying production workloads to vSAN. Health checks have varying levels of importance. As an example, a vSAN network partition impacts the condition of a cluster more than disk utilization imbalance. This blog article helps you identify the health checks that are most important.
Health checks are a set of known conditions that reflect an ideal state or expected behavior. An alert is triggered if there is a deviation from the preferred state. These health checks are built into vSAN and enabled by default. There is no need for you to manually configure the health checks and corresponding alerts. Automated configuration and health check every hour help reduce management overhead and help ensure vSAN is running optimally.
Customer Experience Improvement Program(CEIP) in newer versions of vSAN extends this feature to be a cloud-connected framework. This framework enables dynamic updates to built-in health checks from VMware analytics cloud(VAC). vSAN Health is updated automatically as new issues and recommendations are discovered. Each new release of vSAN is loaded with such feature enhancements and improvements to performance and stability.
More information about CEIP is described here: https://www.vmware.com/in/solutions/trustvmware/ceip.html
Health check alerts are caused by:
- Failure Conditions
- Hardware Incompatibility
- Configuration Inconsistency
- Exceeding software/hardware limits
- Performance degradation
Failures should be treated with the highest priority. Any failure of software or hardware components of vSAN will directly affect the availability and performance. Failure bound alerts typically indicate network related issues, defunct services or device failures.
The second critical set of alerts is Hardware Compatibility alerts. It is wrong to assume that hardware compatibility issues do not arise post-deployment. Hardware Compatibility health checks continue to validate that the combination of hardware and software. Incompatible hardware, firmware, and drivers continue to be one of the most common causes of issues reported to VMware Support.
Configuration inconsistency and scalability related alerts arise mostly out of changes made to the environment. vSAN ReadyNodes and appliances such as Dell EMC VxRail are pre-configured to help ensure deployment and operational recommendations are followed. The health checks complement the initial deployment to maintain an optimal health state of the environment.
Performance degradation alerts are mostly an outcome of one or more of the above conditions. The alerts can also arise due to over-commitment of resources as you add workloads to a cluster. These health checks help ensure that virtual machines are in a healthy state and performing as desired.
Addressing an alert
You can access vSAN Health Check UI by navigating to:
[vSAN Cluster] > Monitor > vSAN > Health
Each health check alert is self-explanatory and manifests the details of the environment object that complies or deviates from the normal state. Wherever applicable, you can initiate a workflow to remediate the issue. The built-in workflows make it easy for you to find additional information about the issue and quickly resolve problems.
The "Info" section further explains in detail the unmet condition and the ideal state. There is an "Ask VMware" button in the "Info" section. You can click on the "Ask VMware" button to access the related Knowledge Base article for additional information.
Environments can be unique with a variety of software and hardware combination. All the alerts may not be applicable to every environment. In such cases, it may be prudent to silence the inapplicable alerts. For example, an environment may have a compliance requirement to stay with the penultimate release of a product. In such cases, you may silence the vSAN build recommendation health check. This helps narrow the focus on alerts that are relevant to the environment. It is strongly recommended to periodically revisit the alerts that are silenced to ensure if it remains inapplicable.
The steps to silence alerts are described in this article: KB-2151813
vSAN Health is built into vCenter Server and vSAN. The alerts go beyond the standard failure detection alarms and additionally check for configuration and compatibility issues. Health check alerts that indicate failure conditions or hardware incompatibility should be treated with the highest priority.
Some of the critical health check alerts are listed here,
- vSAN cluster partition
- vSAN: Basic (unicast) connectivity check
- Hardware Compatibility
- SCSI controller is VMware certified
- Controller driver is VMware certified
- Controller firmware is VMware certified
- Physical disk - Operation health
- Data - vSAN object health
Enable CEIP to benefit from the cloud-connected framework and keep alerts up-to-date. You can monitor and remediate many issues directly in the vSphere Client. This helps reduce operational costs and lower resolution times when an issue does occur.