By Duncan Epping, Principal Architect.
I received a question on twitter last week around HA split brain scenarios. Let me give an example first of when a split brain scenario could occur:
- Isolation response = leave powered on
- iSCSI / NFS storage
When the above two requirements are met and a host in your cluster is fully network isolated HA will be able to restart the virtual machine as it will appear to HA as the host has completely failed. There reason for this is because:
- There will be no network heartbeats coming from this host
- There will be no datastore heartbeats
- File locks on virtual machines are released by the "isolated host" as it cannot access storage any longer
- The management address of this host cannot be pinged
- The "isolated host" cannot write to the datastore to inform the master it is isolated
On top of that the host which is isolated will also take no response as "leave powered on" was selected. In other words, the virtual machines running on the isolated host will just remain up and running. As to the master it seems that the host has failed it will initiate the restart of the impacted VMs. Because the full host has isolated, including storage network, the VMs can be powered on as the "file lock" that the isolated host had times out.
Now as soon as the isolated host return you will have two instances of the same VM on the network. However only one of these has disk access. The one which doesn't have disk access will automatically be killed by the host it is running on. This was introduced in vSphere 4 U2 and still applies today.
Of course this whole situation could be prevented, you could just change the "isolation response" to "power off" and this is what we recommend!