vSAN Failure Scenarios

vSAN provides both enterprise-class scale and performance suitable for businesses of all types and sizes. When designing a vSAN cluster, there are several things to consider, such as hardware, networking, and vSAN architecture. A good place to start is the vSAN Design and Sizing Guide but it’s important to understand how vSAN responds to the various failure scenarios when making design decisions. This post identifies some of the more common failure scenarios, how vSAN responds, and what the overall impact will be to the Virtual Machine.

Failure Scenarios

With most storage systems, failures are typically identified as either temporary, permanent or unknown. vSAN categorizes failures as either “absent” aka ALL Paths Down (APD), or “degraded” aka Physical Device Loss (PDL).

A degraded state is when a device is known to have failed in such a way that it is unlikely that it will return to being healthy. In this case, the rebuild begins immediately. An example is if a drive is experiencing write failures. Another example is if a controller is reporting failure.

Not all device failures are permanent. As a matter of fact, it’s more common for a device to be missing temporarily and will likely return. An absent state is when a device loses connectivity and vSAN is unable to determine if it will return. These rebuilds are delayed by 1 hour by default to determine if they are transitory. This will avoid unnecessary rebuilds that could potentially impact cluster-wide performance, or result in a longer time to recovery to a healthy state. Some examples include host restarting, crashing, loss of power or network or drive becoming disconnected. For cases where an administrator needs to adjust the time vSAN waits before it begins to rebuild data to reestablish compliance with storage policies, a new “object repair timer delay” setting is in the UI as of vSAN 6.7 U1.

Scenario	vSAN Behavior	Impact/Observed VMware HA Behavior
Cache disk failure	Disk Group is marked as failed and all components present on it will rebuild on another Disk Group.	VM will continue running.
Capacity disk failure (Dedupe and Compression ON)	Disk Group is marked as failed and all components present on it will rebuild on another Disk Group.	VM will continue running.
Capacity disk failure (Dedupe and Compression OFF)	Disk marked as failed and all components present on it will rebuild on another disk.	VM will continue running.
Capacity disk failure (Compression-Only ON)	Disk marked as failed and all components present on it will rebuild on another disk.	VM will continue running.
Disk Group failure/offline	All components present on the Disk Group will rebuild on another Disk Group.	VM will continue running.
RAID/HBA card failure	All Disk Groups backed by the HBA/RAID card will be marked absent and all components present will rebuild on other Disk Groups.	VM will continue running.
Host failure	Component on the host will be marked as absent by vSAN – component rebuild will be kicked off after 60 minutes if the host does not come back up.	VM will continue running if on another host. If the VM was running on the same host as the failure an HA restart of the VM will take place.
Host isolation	Components present on the host will be marked as absent by vSAN – component rebuilds will be kicked off after 60 minutes if the host does not come back online.	VM will continue running if on another host. If the VM was running on the same host as the failure an HA restart of the VM will take place.

The following animation shows an ESXi host that has been absent for more than 60 minutes. vSAN rebuilds the components on another available host. When the absent host returns, vSAN discards the stale components.

Conclusion

vSAN is a highly resilient storage platform that intelligently manages the performance, efficiency, and availability of all data stored on a cluster. Since VMware vCenter is used as a common control and management plane for a vSphere cluster, questions may arise when determining how a vSAN cluster reacts when a vCenter server must be rebuilt from a new installation, or restored from a backup. For more information on this topic see Replacing a vCenter server for existing vSAN hosts.

Below are a few key resources that will aid in the design of your vSAN cluster.

@vPedroArrow

Failure Scenarios

Conclusion

Related Articles

Wrap up of VMware vSAN at VMware Explore 2023

Tech Zone Blog Updates Highlight: Exploring the Latest Innovation in VMware vSAN

Top Ten VMware Explore Storage Sessions