vSAN

Proactive HA, Health Checks and Alarms for a Healthy vSAN Datastore

Proactive HA, Health Checks and Alarms for a Healthy vSAN Datastore

proactive HA vSAN

Troubleshooting your environment becomes easier with the right tools in place to detect the potential root cause of issues and offer the best guidance to resolve the issues. vSAN 7 Update 2 introduces several enhancements to help you in this endeavor. In particular, the new detailed alarms and historical health checks provide you with a more efficient way to analyze any failures. In addition, the new vSAN support for vSphere Proactive High availability (HA) can help you proactively migrate any workloads in case a hardware warning has been detected.

Let’s take a closer look at each enhancement to better understand how they can help you identify the origin of an issue and restore the healthy state of your cluster.

Early Detection and Action Against Potential Hardware Issues with Proactive HA

This service allows vCenter Server to detect server alarms coming from the hardware vendor’s plug-ins. A dedicated API is responsible to communicate the host’s health information coming from the underlying hardware to vCenter and to alarm Proactive HA in case of an imminent issue. Proactive HA can be set to respond proactively depending on the severity of the detected problem in compliance with the customer’s needs and the cluster’s specifics.

Proactive HA supported by vSAN not only migrates the VMs but the VM’s data as well. The host that is at imminent risk can be either put in maintenance mode, or in quarantine mode. Quarantine mode is a host state which is not visible in the UI but results in the host’s resources being used as little as possible. This mode can be helpful when the cluster has a limited number of fault domains to be used and the risk detected has been classified as moderate. Proactively placing the host in maintenance mode is preferable for larger clusters with enough free capacity to accommodate the workload data previously residing on the host. For all severe issues, it is recommended to place the host in maintenance mode, and the data evacuation mode can be specified to best suit the cluster’s needs. You can choose from the “No action”, “Ensure accessibility” (default mode) and “Evacuate all data” options. Proactive HA supported by vSAN 7 Update 2 also includes a Mixed mode, which will make sure to enter hosts with severe hardware issues in maintenance mode, and those with moderate risk status in quarantine mode. All these options are automatically maintaining the workload’s health until the cluster is brought back to fully operational.

Decrease the Time to Action with vSAN Alarms

vSAN 7 Update 2 enriches the context of the available vSAN alarms and optimizes the alarms’ titles so an admin can significantly reduce the time needed to investigate the issue and take timely action. They’re more descriptive and can be easily analyzed. As a result, all services like vROps that are feeding on these alarms will be improved as well. Additionally, an overall alert summary is available, and upon subscription, an admin can receive all vSAN alert updates at once.

Obtain a Broader and Time-based View on Your Cluster’s Health

Health check history is now available under the Skyline health section in vSAN 7 Update 2. All discrete health checks are now visualized in a timeline so the issue can be tracked back and analyzed granularly. This service can be easily enabled when you need to gain more insight into the transient conditions of the vSAN cluster. It is easy to imagine how helpful this health history can be when the environment experienced some temporary issue but recovered. The history can help you easily see the series of events that happened in an easy-to-view timeline view.

Summary

Proactive HA, vSAN alarms, and historical health checks, in addition to all the operational improvements in vSAN 7 Update 2, are improving the spectrum of tools that will help you run your vSAN environment intelligently. This enables you to spend less time in research and root cause tracking.