Product Announcements

PSODs and VMware HA

Despite our best efforts here at VMware, there are occasions where a PSOD may occur.  PSODs can occur for a wide range of reasons, including out of memory and hung CPU conditions.  In a HA protected environment, if a PSOD does occur you would expect that the VMs that were running on the host that encountered the PSOD would be failed over to another host.  There does exist a corner case where this may not be the case.  Let me explain why:
 
Before I gointo details here, let me first make absolutely clear a very important point.  What I am about to describe is a rare corner case.  Odds are you likely have not seen it, nor will you.  It is possible, however, and for those who have experienced it, I hope that this makes things a bit clearer.
 
ESX, as you know, has what we refer to as the ‘COS’, or Service Console.  Sometimes, the COS may become unresponsive.  This can be due to a variety of reasons, but commonly it is due to an issue withmemory.  For example, there may be a memory leak in a process, a third party application consuming a large amount of memory, or excessive numbers of processes running in the COS.
 
When the COS becomes unresponsive or throws an error, the VMkernel detects this and starts to generate a core dump.  This core dump is invaluable for troubleshooting and basically consists of an entire dump of memory as well as some additional overhead.  Of course, this is all compressed so as to save space.  During the time that the VMkernel is performing the dump, it tries to keep everything exactly like it was when the problem with the COS occurred until after the dump completes.  This includes the locks on any datastores used by VMs that were running.  Once this process is finished, the user will see a PSOD.
 
Normally, this core dump occurs very rapidly – less than a minute in most cases.  In rare cases though, it might take longer.  I’ve heard rumors that some people have seen these core dumps take as long as 20 minutes.  This is where the problem starts for people with VMware HA enabled.
 
In this scenario, only the COS has an issue.  VMkernel and everything else is completely functional.  This means that it can still respond to ICMP pings and the like.  Since the COS is not functional though, VMware HA will detect this as a failure of the system and try to restart the VMs that were hosted on another system in the cluster.   However, the‘failed’ system is not completely failed until it completes the core dump.   If VMware HA tries to start a VM on another host, it will fail as that VM is still locked, or in use by the failed system.  VMware HA tries multiple times to restart the VMs.  If for some reason though, the core dump takes an extreme amount of time to complete, VMware HA may time out and give up trying to restart the VMs.  The end result here is that once the failed system does PSOD, it appears that VMware HA failed, as the VMs are not restarted.
 
How can you prevent this from happening to you? 
 
One possible action is to not run anything within the COS.  By doing this, you will eliminate the possibility of an application or script that has not been thoroughly tested by VMware from causing issues within the COS.  
 
Another solution would be to disable the ability of vmkernel to perform a core dump.  This is not a very viable solution for many, as doing this eliminates critical information needed to perform a RCA.  Thus, you might not be able to get to the root cause of your problem.  I’d only recommend doing this in the rare case where you have a known issue with a server but are unable to fix it immediately. 
 
The simplest solution is to use ESXi.  As ESXi provides for a simpler more secure environment without a COS, this problem simply doesn’t exist. 
 
For more information, I would recommend looking at the following KB articles:

VMware ESX and ESXi 4.1 Comparison
Configuring an ESX host to capture a Service Console coredump
Understanding a "Lost Heartbeat" purple diagnostic screen
Understanding an "Oops" purple diagnostic screen