Home > Blogs > VMware vSphere Blog

vSphere HA VM Monitoring – Back to Basics

In my experience as a customer, partner and working for VMware, I’ve found HA VM monitoring to be an incredibly helpful feature that I am consistently surprised is not used more. It is easy to turn on, provides an additional layer of protection for your VMs and just works. So why don’t more people use it? I am not going to be able to answer that question in this post, though I hope to provide enough information to get more people to try it out. First I’d like to briefly discuss what HA VM monitoring is, then I’ll walk through how to turn it on and configure it.

What is vSphere HA VM monitoring ? HA VM monitoring will restart a VM if:

  • That VMs VMware Tools heartbeats are not received in a set period of time (see below for details) and
  • The VM isn’t generating any storage or network IO (for 120 seconds by default, though this can be changed using the following advanced cluster level setting: das.iostatsInterval)

Why wouldn’t VMware Tools send heartbeats and the VM stop generating IO? More than likely because the Guest Operating System on the VM has crashed (eg. Blue Screen of Death) or become otherwise very unresponsive.  At this point the best thing to do to keep the application as available as possible is to reset the VM.

What if there is something related to what caused the crash displayed on the screen? If the VM is reset that is going to be lost, right? No, to assist with troubleshooting the cause of the OS crash, just before the VM is reset, a screenshot is taken of the VM and placed with the VMs files.

When exactly will the VM be reset? There are 3 built in presets (Low, Medium & High) and the option to select custom settings for any of these options.

Failure Interval Minimum uptime Maximum per-VM resets Maximum resets time window
Low 120 secs 480 secs 3 7 days
Medium 60 secs 240 secs 3 24 hrs
High 30 secs 120 secs 3 1 hr

What do the different options mean?

  • Failure interval: HA will restart the VM if the VM heartbeat has not been received in this interval
  • Minimum uptime: HA will wait this long after a VM is started to begin monitoring for VM tools heartbeats, storage and network IO
  • Maximum per-VM resets: HA will restart the VM a maximum of this many times within the “Maximum resets time window”
  • Maximum resets time window: (see “Maximum per-VM resets” above)

 How do you enable and configure HA VM monitoring?

  • Select the cluster where you want to enable HA VM monitoring then select Manage > Settings > Services > vSphere HA and click the Edit button

Edit HA settings

  • Under VM Monitoring > VM Monitoring Status select VM Monitoring Only

HA VM Monitoring Settings 2

  • For Monitoring Sensitivity select a preset or choose custom settings

If you want to exempt VMs from VM Monitoring utilize the Cluster > VM Overrides setting

Cluster VM Overrides

I look forward to hearing about your experiences with HA VM Monitoring and HA in general.

For future updates follow me on Twitter: @gurusimran

13 thoughts on “vSphere HA VM Monitoring – Back to Basics

  1. larstr

    I think the reason that it’s not used more is that it has not been a flawless function. I had a customer that under high load would get sudden reboots. It turned out that HA VM Monitoring was causing it. We set sensitivity to low, but the VM still kept rebooting randomly so we eventually turned it off and it hasn’t rebooted for this reason ever since.

    Things may have improved in 5.5 so it could be worth trying enabling it again.


  2. Joel

    I think people are skeptical of certain types of automation, particularly automatic rebooting. I have faith in HA and DRS but for some reason it’s sacrilegious to think of rebooting a guest without intervention.

  3. Jeff HunterJeff Hunter

    In response to larstr and Joel above, I agree there are exceptions to nearly every “policy”. There may be a VM that is under high load where it makes sense to override the default policy (either different settings or turned completely off). However, in the case of the high load VM, if it is not able to send VMware Tools heartbeats for a solid two minutes (120 second failure interval), then I suspect you have bigger problems: VM is not sized correctly, network is overloaded, etc. As for automatic rebooting, again I think there are a few scenarios where you would not want this, but these are probably exceptions, not the rule. In most environments, uptime is king – using vSphere HA VM and App monitoring contributes to that effort.

  4. Brian Graf

    Excellent write-up GS! From the customers I have talked to, the biggest reason for not doing it is the fear that it may “break” something. If their VM’s are currently running and not having issues they think “if it ain’t broke, don’t fix it”. However, if and when their VM’s do crash, they wish they’d enabled this. I think more people would be enabling this if they were to see the number of people who DO use this and realize, “Hey, this really is a good thing to enable”.

  5. Brian

    How about an option to alert with these conditions are met, before fully enabling the automated reboot option? That would give us an opportunity to see how it would react in our environments before hand?

  6. Herschelle

    What happens to VMs that do have VM Tools installed if you set the VM HA at the cluster level? E.g. Some virtual appliances. Would they just keep rebooting? Or because it also check the IO they would be ok?

  7. jagadeesha

    We are getting below description in incidents which we received, any idea why we are getting this alerts.

    “Windows Uptime monitor: 95 seconds elapsed since VM was rebooted.”

    Jagadeesha SC

  8. James Brown

    GS, Nice article!!!

    My take is based upon working within various customer vSphere environments and having HA Cluster feature enabled. It works for some of them and others not as well. For those whom it doesn’t work well for is as follows:

    One the biggest issues facing the use of HA VM Monitoring is that the VM and its vSphere environment is not properly configured by the customer in the first place. The VM can be under a heavy load and reboot if the VMware Tools heartbeat is not sent and received within a specified time period. If the VMware tools are not installed, outdated, etc…then HA is not going to properly work. If your vSphere environment does not have the proper resources for the VMs and its corresponding Hosts then a person should not expect HA to properly provide you the best result. Your vSphere house has to be in order first to allow the feature to operate with any extraneous hindrances.

  9. Valerio

    Hi, do you know if there’s a way to use VM Monitoring with network connectivity loss? I’m experiencing this issue on my 4 nodes vSphere 5.5 cluster: VM Monitoring on, high sensitivity, network loss, the VM stays up. Same thing with a FC failure (APD event). I suppose that in these cases the VM will be restarted and relocated on remaining cluster nodes, right?
    Thank you!

  10. Tom Spirit

    We are using HA with a cluster for almost a year. Everything works fine except that we have an issue with Windows Server 2012R2 and HA.
    What is happening is when WinSrv2012R2 installs updates, on its final restart sometimes the HA is resetting the VM. We don’t know why is this happening. I found a discussion on technet about that issue but without solution…

    Is there anyone with a similar problem?

  11. smr

    Just had a call with VMware support suggestings to disable VM monitoring. Doesn’t seem like they trust that feature.
    One of my 15 VMs in a 2 node cluster (ESXi 6.0) rebooted while I was migrating another VM onto the same host. Happened 2 times. Same scenario.
    No traces whatsoever about the cause for the heartbeat loss. Neither from the ESXi logs nor from the VM (RHEL6.6) system logs.
    VMware’s workaround is to disable the feature. No solution.

  12. No VMmonitoring

    DON”T ENABLE this on esxi 5.1. We had it enabled. It seemed ok for a while but then it rebooted a chunk of our VMs twice within a span of 5 days. The VMs were functional and on the network when this happened.


Leave a Reply

Your email address will not be published. Required fields are marked *