I was recently helping a customer design a highly resilient Virtual SAN deployment for tier 1 applications.  One thing that was clear was that the traditional rules for designing traditional Tier 1 modular storage arrays were not always applicable.

Rule of scaling storage risk: consolidating more workloads to a storage system increases efficiency, but also increases the size of impact from multiple components failures bringing down all workloads.

With traditional arrays the risk impact of two controllers failing (or one failing during an upgrade), or two disks inside a RAID 5 scaled as you added more drives into a system. The systems scaled performance by adding more drives. They eventually reached a scale at which they did not want to consolidate further workloads as the time required for recovery, and risk of the scale of the impact from outweighed the benefits of scaling up further.

Scale out systems (often using a RAID 10 like design) also carried similar growing risk exposure as they grew in size. It was not uncommon to have hundreds of capacity devices servicing IO for a single LUN. The wrong two devices failing would bring down all workloads. While they could use 3 fault domains they still faced a 33% chance of catastrophic data loss of all data in the cluster should the wrong two devices fail. This was deemed unacceptable and they were looking for a new solution as their existing platform became a bigger radius of impact as it grew. Adding more fault domains could be done, but even expanding to 5 fault domains still led to a 20% chance of two failures being in the same zone.

Reviewing Failure Scenarios

Virtual SAN is different in this case in that it primarily receives its performance capabilities from flash and memory storage devices. While striping can be used, it is not leveraged by default. This has a unique impact in that the odds of a virtual machine being impacted by compounding failure lowers as a cluster grows. Assuming a 100GB Virtual Machine is protected by a failures to tolerate policy set for one host (FTT=1). When using this policy with RAID 1 mirroring we would have 3 components on different hosts (2 copies of the data and a witness). In order for the virtual machine to remain available two of the 3 components must remain available. In the case of a 4 host cluster the first failure would have a 3/4 chance (.75) and the second subsequent failure would have a 2/3 (.6666%) chance. Multiplying these odds we come to a ~50% chance of being impacted. Scaling this math out though has some interesting implications as either failure impacting a given virtual machine becomes smaller and smaller.


FTT=1 with Two Hosts Failing

Looking at these odds also for RAID 5 we notice a trend that as the cluster grows the chance of a given object being impacted by multiple simultaneous failures drops exponentially quickly.

Notice that FTT=2 two policies (such as RAID 6) can survive two host failures without any chance of impact to a given object.

Another factor that was reviewed was that as Virtual SAN grew, the speed of its rebuilds also increased as more disks and hosts can be involved in the rebuild. These scenarios assume simultaneous failures. Cascading failures spaced out by several hours could result in the cluster healing before the next failure due to the increased speed at which virtual SAN can rebuild.