VMware Virtual SAN is a highly resilient storage platform. Part of this resiliency is its distributed architecture that allows for rebuilds that leverage the full performance of the cluster to complete quickly.
One interesting piece is how it intelligently handles different failure conditions. In some cases VSAN will delay the start of a rebuild to reduce the total size and time to completion of the rebuild.
Physical Device Loss (PDL) errors are when a device is known to have failed in such a way that it is unlikely that it will return to being healthy. In this case the rebuild begins immediately. An example is if a drive is experiencing write failures. Another example is if a controller is reporting failure.
All paths Down (APD) failure is when a device loses connectivity and VSAN is unable to determine if it will return. These rebuilds are delayed by 1 hour by default to determine if they are transitory. This will avoid unnecessary rebuilds that could potentially impact cluster wide performance, or result in a longer time to recovery of a healthy state. The VSAN.ClomdRepairDelay advanced setting controls this timeout, but it is recommended to leave at the 1 hour default.
Some examples of APD failures include:
- Host related:
- Loss of network to a host.
- Loss of power to a host
- Host restarting or crashing.
- Host left in maintenance mode.
- Drive related:
- Drive becoming disconnected
- HBA becoming disconnected or restarting
In all of these cases there is a clear chance that the issue may resolve itself, or by an administrator. When a component returns from APD within 1 hour only a re-sync of the data that was written to the other components is required. In the case of a host being forcibly restarted (ether from a crash, or an operational error) the host should return online within this 1 hour period. Immediately upon this error, the changes to objects with reduced redundancy are tracked. The rebuild will consist of a re-sync and would likely take minutes rather than the potential hours that a full host re-sync could take. Assuming a host data change rate is only 2% per hour, this rebuild will involve 50x less data transfer. A switch restarting, or an accidental unplugging of the wrong device triggering a full rebuild is what the APD timeout seeks to avoid.
One example where you may want to change VSAN.ClomdRepairDelay is for testing failure conditions as part of a proof of concept. In this case you will need to change the setting on all hosts in the cluster and restart clomd on each host. For more information on testing hardware failures. See section 12 of the VSAN Proof of Concept Guide.