This is part 2 of the series. Click here for the first part.
FTT=2 With Three Hosts Failing
Now for our next example we wanted to look at what would happen if 3 hosts were lost simultaneously while using FTT=2 what would the impact be. As FTT=2 two policies require larger clusters by default (6 hosts for RAID 6, 7 hosts for RAID 1 protection) we noticed the trend accelerated from the beginning. Even in smaller clusters quickly only a small number of virtual machines would be impacted. Upon reviewing the data it was decided that rather than deploying smaller clusters (8 hosts) they would take advantage of Virtual SAN's scaling capabilities and deploy 24 host clusters instead and leverage better CPU/Storage Efficiency. As each cluster could deliver millions of IOPS they chose to split their application across clusters.
It should be noted that increasing the stripe policy or using VMDK's over 255GB will stripe automatically. This does marginally increase the impact but not by much as clusters scale larger and larger (as the stripe will not necessarily avoid existing hosts that data is stored on, just existing storage devices of the current object).
With two or three hosts failing at the same time the most likely reason is an external dependency experiencing failure Examples of this could be a top of rack switch stack failing, or a power distribution unit or cooling system failing.
Fault domains allow a Virtual SAN administrator to choose where data copies will be placed to spread out copies of data across multiple rack's, rooms, floors, or pods within a data center. It can provide a further multiplier on which to reduce the risk of impact of compounded failures further. I would strongly recommending reading Cormac's post here on fault domains.
Native 5 minute RPO Replication, Fault Domain awareness, and stretched clustering add to these capabilities to demonstrate that virtual SAN was purposely designed to be the most resilient storage solution for your virtual storage needs.
I have had a lot of customer discussions recently where Virtual SAN is being deployed in situations that lives depend on it. It is increasingly being chosen above legacy storage for the extensive failure handling capabilities it can offer. If multiple host failures are a concern, Virtual SAN offers large (and small) clusters to design highly resilient systems that can survive multiple cascading failures.