Today we released a very important KB article, KB 2135952, for VSAN stretched cluster. It informs customers on how to handle a split partition/network partition scenario on a VSAN stretched cluster.
The reason we released the KB was as follows:
In the event of a split-brain scenario in a VSAN stretched cluster, where the two data sites can no longer communicate, VSAN stretched cluster has the concept of a “preferred” site. This means that the “preferred” site and the “witness” form a quorum, and the non-preferred or secondary site is isolated. Now, vSphere HA will restart the VMs that were on the secondary/non-preferred site on the “preferred” site. All VMs now run on the preferred site.
The downside of this is, because the non-preferred/secondary site no longer has access to the witness, the VMs on the secondary site can no longer access the underlying VSAN datastore. Thus, they cannot be powered down, and remain running in a “ghost” state. This is how vSphere Metro Cluster Service used to work until support for component protection was introduced. The KB guides you though the steps to get rid of the “ghost” VMs on the secondary site after a failure of this nature, and even provides a script to allow some automation of the process.
This is a temporary solution to this situation. VMware is working on more permanent solutions going forward which will negate the need to use this workaround.