vSAN

vSAN Degraded Device Handling

Introduction

While this blog article focuses more on availability, performance is certainly worth mentioning. A poorly performing application or platform can be the equivalent of being offline. For example, excessive latency (network, drive, etc.) can cause a database query to take much longer than normal. If an end-user expects query results in 30 seconds and suddenly it takes 10 minutes, it is likely the end-user will stop using the application and report the issue to IT – same result as the database being completely offline.

A cache or capacity device that is constantly producing errors and/or high latencies can have a similar negative effect on any HCI platform. This can impact multiple workloads in the cluster. Prior to VSAN 6.1, a badly behaving drive caused issues in a few cases, which led to another vSAN availability feature: Degraded Device Handling, or simply “DDH”.

vSAN 6.1 and newer versions monitor drives for issues such as excessive latency and errors. These symptoms can be indicative of an imminent drive failure. Monitoring these conditions enables vSAN to be proactive in correcting conditions that negatively affects performance and resilience. Depending on the version of vSAN you are running, you might see varying responses to drives that are behaving badly.

DDH in vSAN 6.1

vSAN 6.1 looks for a sustained period of high read and/or write latencies (greater than 50ms). If the condition exists for longer than 10 minutes, vSAN will issue an alarm and unmount the drive. As you can imagine, this can impact a several objects on a vSAN datastore. If the drive is a cache device, this forces the entire disk group offline. Fortunately, cache device failure has become much less common with recent advancements in flash device technology. Here is an example of this happening as shown in a log file:

2015-09-15T02:21:27.270Z cpu8:89341)VSAN Device Monitor: WARNING – READ Average Latency on VSAN device naa.6842b2b006600b001a6b7e5a0582e09a has exceeded threshold value 50 ms 1 times.
2015-09-15T02:21:27.570Z cpu5:89352)VSAN Device Monitor: Unmounting VSAN diskgroup naa.6842b2b006600b001a6b7e5a0582e09a

Components on a disk group in this state are marked “Absent.” Rebuild of these components on other healthy drives will begin after a 60-minute rebuild timer (vsan.clomrepairdelay advanced setting in ESXi) has expired. If an object is not protected by either RAID-1 mirroring or RAID-5/6 erasure coding and it has a component on the unmounted drive, that object will become inaccessible. The figure below shows a virtual disk with an absent component. The virtual disk object is protected by a vSAN storage policy with RAID-1 mirroring. Since the other mirror copy and the witness component are online, the object remains accessible even though there was a physical drive failure.

Taking a drive or entire disk group offline can be somewhat disruptive and sometimes requires rebuilding data. This is something that all storage platforms avoid unless absolutely necessary. With vSAN 6.1, the criteria – errors and/or latency for 10 minutes – might not be as selective as it should be. There are cases where the issue is transient. A drive might produce high latencies for 15 minutes and return to normal performance levels. We want to avoid initiating the movement of large amounts of data in cases like this, which prompted some changes in vSAN 6.2.

DDH in vSAN 6.2

vSAN 6.2 includes four enhancements to improve the reliability and effectiveness of DDH:

1. DDH will not unmount a vSAN caching or capacity drive due to excessive read IO latency. Only write IO latency will trigger an unmount. Taking a drive offline and evacuating all of the data from that drive is usually more disruptive than a sustained period of read IO latency. This change was made to reduce the occurrence of “false positives” where read latency rises beyond the trigger threshold temporarily and returns to normal.

2. By default, DDH will not unmount a caching tier device due to excessive write IO latency. As discussed above, taking a cache device offline causes the unmount of the cache and all capacity devices in the disk group. In most cases, excessive write IO latency at the cache tier will be less disruptive than taking an entire disk group offline. DDH will only unmount a vSAN drive with excessive write IO latency if the device is serving as a capacity device. This global (affects all vSAN drives) setting can be overridden via ESXi command:

esxcfg-advcfg –set 1 /LSOM/lsomSlowTier1DeviceUnmount

Running the command above will instruct VSAN to unmount a caching tier device with excessive read and/or write IO latency.

3. DDH tracks excessive latency over multiple, randomly selected 10-minute time intervals instead of using a single 10-minute interval. This improves the accuracy and reliability of DDH to reduce the occurrence of false positives. Transient elevations in IO from activities such as vSAN component recovery, sector remapping for HDDs, and garbage collection for SSDs should no longer cause DDH false positive issues. To further improve DDH accuracy, latency must exceed the threshold for four, non-consecutive, 10-minute time intervals that are randomly spread out over a six to seven hour time period.

4. DDH attempts to re-mount vSAN drives in failed state or drives previously unmounted by DDH. DDH will attempt to re-mount a drive under these conditions approximately 24 times over a 24-hour period. The re-mount attempt will fail if the condition that caused the drive to go offline is still present.

DDH in vSAN 6.6 and Higher

Enhancements to DDH in this version of vSAN build on the previous release. By default, only capacity tier drives are marked degraded. A degraded drive is determined by measuring the average latency of the drive and detecting excessive latency for an extended period of time. A degrade drive is one where the average write IO round trip latency for four or more latency intervals distributed randomly within approximately a six hour period exceeds pre-determined latency thresholds for a drive. The magnetic drive (HDD) latency threshold is 500 milli-seconds for write IO. The flash device (SSD) latency threshold for read IO is 50 milliseconds while the IO latency for write IO is 200 milliseconds.

When a device is considered degraded, vSAN will shut down as much IO as possible to a degraded device by marking components on the drive “absent” unless this causes an object to become inaccessible. In other words, if components on a degraded device belong to the last available copy of the object, those components will not be marked “absent”. vSAN will immediately start migrating these active components if possible to avoid an object becoming inaccessible. This is in contrast to waiting for the vSAN CLOM Rebuild Timer (60 minutes by default) to expire before rebuilding copies of “absent” components. There are basically four degraded device states with different actions:

1. Preventative evacuation in progress. A yellow health alert is raised so that administrators know there is an issue. vSAN is proactively compensating for the degraded device by migrating all active components from degraded drive. No administrator action required.

2. Preventative evacuation is incomplete due to lack of resources, i.e., a partial evacuation of active components. A red health alert is raised to signify a more serious issue. An administrator will need to either free up existing resources, e.g., deleting unused VMs, or add resources so that vSAN can complete the evacuation. This scenario might occur when there is relatively little free capacity remaining in the cluster – yet another reason we strongly recommend keeping 25-30% free “slack space” capacity in the cluster.

3. Preventative evacuation is incomplete due to inaccessible objects. The remaining components on the drive belong to inaccessible objects. An administrator should make more resources available in an attempt to make the object accessible. The other option is to remove the drive from the vSAN configuration by choosing “no data migration” when the drive is decommissioned.

4. Evacuation complete. As you can imagine, this is the most desirable state for a drive that is in a degraded condition. All components have been migrated from the drive and all objects are accessible. It is safe to remove the drive from the vSAN configuration and replace it when convenient to do so.

Summary

Keep in mind there is no fool-proof way to guarantee 100% accuracy in predicting the failure of a drive. In some cases, a drive will fail without any warning. That is why vSAN provides flexibility on a per-object basis to specify the tolerance of one, two, or three drive or entire host failures. Configuring these levels of resiliency is done simply through the use of vSAN storage policies. Predicting device failure and the proactive migration of data from a degraded device further enhances the resilience of a vSAN datastore.