By Duncan Epping, Principal Architect, VMware
I had this question last week when I was presenting at the UK VMUG. I figured not everyone was aware of this architectural change hence the reason for this article.
Lets be clear here, “das.failuredetectiontime” is no longer available in vSphere 5.0. I know many of you used this advanced setting to tweak when the host would trigger the isolation response, that is no longer possible. Keep in mind that because of the fact that “das.failuredetectiontime” is missing the timing with regards to failovers initiated after an isolation have changed. I have listed the timeline for both the isolation of a master and the failure of a slave below:
Isolation of a slave:
- T0 – Isolation of the host (slave)
- T10s – Slave enters “election state”
- T25s – Slave elects itself as master
- T25s – Slave pings “isolation addresses”
- T30s – Slave declares itself isolated and “triggers” isolation response
Isolation of a master:
- T0 – Isolation of the host (master)
- T0 – Master pings “isolation addresses”
- T5s – Master declares itself isolated and “triggers” isolation response
After the completion of this sequence, the (new) master will learn the host was isolated and will restart virtual machines based on the information provided by the slave.
As shown there is a clear difference and of course the reason for it being is the fact that when the master isolates there is no need to trigger an election process which will be needed in the case of a slave to detect if it is isolated or partitioned. One again, before the isolation response is triggered the host will validate if a host will be capable of restarting the virtual machines… no need to incur downtime when it is unnecessary.
I hope this helps providing a better understanding of the process.