By Duncan Epping, Principal Architect.
A while back I wrote this article about a split brain scenario with vSphere HA. Although we have multiple techniques to mitigate these scenarios it is always better to prevent. I had already blogged about this before but I figured it wouldn’t hurt to get this out again and elaborate on it a bit more.
First some basics…
What is an “Isolation Response”?
The isolation response refers to the action that vSphere HA takes when the heartbeat network is isolated. The heartbeat network is usually the management network of an ESXi host. When a host does not receive any heartbeats it will trigger the response after an X number of seconds. So when exactly? Well that depends if the host is a slave or a master. This is the timeline:
Isolation of a slave
- T0 – Isolation of the host (slave)
- T10s – Slave enters “election state”
- T25s – Slave elects itself as master
- T25s – Slave pings “isolation addresses”
- T30s – Slave declares itself isolated and “triggers” isolation response
Isolation of a master
- T0 – Isolation of the host (master)
- T0 – Master pings “isolation addresses”
- T5s – Master declares itself isolated and “triggers” isolation response
What are my options?
Today there are three options for the isolation response. The responses is what the host will do for the virtual machines running on that host when it has validated it is isolated.
- Power off – When a network isolation occurs all VMs are powered off. It is a hard stop.
- Shut down – When a network isolation occurs all VMs running on that host are shut down via VMware Tools. If this is not successful within 5 minutes a “power off” will be executed.
- Leave powered on – When a network isolation occurs on the host the state of the VMs remains unchanged.
Now that we know what the options are. Which one should you use? Well this depends on your environment. Are you using iSCSI/NAS? Do you have a converged network infrastructure? We’ve put the most common scenarios in a table.
|Likelihood that host will retain access to VM datastores||Likelihood that host will retain access to VM network||Recommended Isolation policy||Explanation|
|Likely||Likely||Leave Powered On||VM is running fine so why power it off?|
|Likely||Unlikely||Either Leave Powered On or Shutdown||Choose shutdown to allow HA to restart VMs on hosts that are not isolated and hence are likely to have access to storage|
|Unlikely||Likely||Power Off||Use Power Off to avoid having two instances of the same VM on the VM network|
|Unlikely||Unlikely||Leave Powered On or Power Off||Leave Powered on if the VM can recover from the network/datastore outage if it is not restarted because of the isolation, and Power Off if it likely can’t.|
But why is it important…. Well just imagine you pick “leave powered on” and you have a converged network environment and are using iSCSI storage, chances are fairly big that when the host management network is isolated… so is the virtual machine network and the storage for your virtual machine. In that case, having the virtual machine restarted will reduce the amount of “downtime” from an “application / service” perspective.
I hope this helps making the right decision for the vSphere HA isolation response. Although it is just a small part of what vSphere HA does, it is important to understand the impact a wrong decision can have.