Home > Blogs > VMware vSphere Blog

vSphere HA VM Monitoring – Back to Basics

In my experience as a customer, partner and working for VMware, I’ve found HA VM monitoring to be an incredibly helpful feature that I am consistently surprised is not used more. It is easy to turn on, provides an additional layer of protection for your VMs and just works. So why don’t more people use it? I am not going to be able to answer that question in this post, though I hope to provide enough information to get more people to try it out. First I’d like to briefly discuss what HA VM monitoring is, then I’ll walk through how to turn it on and configure it.

What is vSphere HA VM monitoring ? HA VM monitoring will restart a VM if:

  • That VMs VMware Tools heartbeats are not received in a set period of time (see below for details) and
  • The VM isn’t generating any storage or network IO (for 120 seconds by default, though this can be changed using the following advanced cluster level setting: das.iostatsInterval)

Why wouldn’t VMware Tools send heartbeats and the VM stop generating IO? More than likely because the Guest Operating System on the VM has crashed (eg. Blue Screen of Death) or become otherwise very unresponsive.  At this point the best thing to do to keep the application as available as possible is to reset the VM.

What if there is something related to what caused the crash displayed on the screen? If the VM is reset that is going to be lost, right? No, to assist with troubleshooting the cause of the OS crash, just before the VM is reset, a screenshot is taken of the VM and placed with the VMs files.

When exactly will the VM be reset? There are 3 built in presets (Low, Medium & High) and the option to select custom settings for any of these options.

Failure Interval Minimum uptime Maximum per-VM resets Maximum resets time window
Low 120 secs 480 secs 3 7 days
Medium 60 secs 240 secs 3 24 hrs
High 30 secs 120 secs 3 1 hr

What do the different options mean?

  • Failure interval: HA will restart the VM if the VM heartbeat has not been received in this interval
  • Minimum uptime: HA will wait this long after a VM is started to begin monitoring for VM tools heartbeats, storage and network IO
  • Maximum per-VM resets: HA will restart the VM a maximum of this many times within the “Maximum resets time window”
  • Maximum resets time window: (see “Maximum per-VM resets” above)

 How do you enable and configure HA VM monitoring?

  • Select the cluster where you want to enable HA VM monitoring then select Manage > Settings > Services > vSphere HA and click the Edit button

Edit HA settings

  • Under VM Monitoring > VM Monitoring Status select VM Monitoring Only

HA VM Monitoring Settings 2

  • For Monitoring Sensitivity select a preset or choose custom settings

If you want to exempt VMs from VM Monitoring utilize the Cluster > VM Overrides setting

Cluster VM Overrides

I look forward to hearing about your experiences with HA VM Monitoring and HA in general.

For future updates follow me on Twitter: @gurusimran

22 thoughts on “vSphere HA VM Monitoring – Back to Basics

  1. larstr

    I think the reason that it’s not used more is that it has not been a flawless function. I had a customer that under high load would get sudden reboots. It turned out that HA VM Monitoring was causing it. We set sensitivity to low, but the VM still kept rebooting randomly so we eventually turned it off and it hasn’t rebooted for this reason ever since.

    Things may have improved in 5.5 so it could be worth trying enabling it again.


  2. Joel

    I think people are skeptical of certain types of automation, particularly automatic rebooting. I have faith in HA and DRS but for some reason it’s sacrilegious to think of rebooting a guest without intervention.

  3. Jeff HunterJeff Hunter

    In response to larstr and Joel above, I agree there are exceptions to nearly every “policy”. There may be a VM that is under high load where it makes sense to override the default policy (either different settings or turned completely off). However, in the case of the high load VM, if it is not able to send VMware Tools heartbeats for a solid two minutes (120 second failure interval), then I suspect you have bigger problems: VM is not sized correctly, network is overloaded, etc. As for automatic rebooting, again I think there are a few scenarios where you would not want this, but these are probably exceptions, not the rule. In most environments, uptime is king – using vSphere HA VM and App monitoring contributes to that effort.

  4. Brian Graf

    Excellent write-up GS! From the customers I have talked to, the biggest reason for not doing it is the fear that it may “break” something. If their VM’s are currently running and not having issues they think “if it ain’t broke, don’t fix it”. However, if and when their VM’s do crash, they wish they’d enabled this. I think more people would be enabling this if they were to see the number of people who DO use this and realize, “Hey, this really is a good thing to enable”.

  5. Brian

    How about an option to alert with these conditions are met, before fully enabling the automated reboot option? That would give us an opportunity to see how it would react in our environments before hand?

  6. Herschelle

    What happens to VMs that do have VM Tools installed if you set the VM HA at the cluster level? E.g. Some virtual appliances. Would they just keep rebooting? Or because it also check the IO they would be ok?

  7. jagadeesha

    We are getting below description in incidents which we received, any idea why we are getting this alerts.

    “Windows Uptime monitor: 95 seconds elapsed since VM was rebooted.”

    Jagadeesha SC

  8. James Brown

    GS, Nice article!!!

    My take is based upon working within various customer vSphere environments and having HA Cluster feature enabled. It works for some of them and others not as well. For those whom it doesn’t work well for is as follows:

    One the biggest issues facing the use of HA VM Monitoring is that the VM and its vSphere environment is not properly configured by the customer in the first place. The VM can be under a heavy load and reboot if the VMware Tools heartbeat is not sent and received within a specified time period. If the VMware tools are not installed, outdated, etc…then HA is not going to properly work. If your vSphere environment does not have the proper resources for the VMs and its corresponding Hosts then a person should not expect HA to properly provide you the best result. Your vSphere house has to be in order first to allow the feature to operate with any extraneous hindrances.

  9. Valerio

    Hi, do you know if there’s a way to use VM Monitoring with network connectivity loss? I’m experiencing this issue on my 4 nodes vSphere 5.5 cluster: VM Monitoring on, high sensitivity, network loss, the VM stays up. Same thing with a FC failure (APD event). I suppose that in these cases the VM will be restarted and relocated on remaining cluster nodes, right?
    Thank you!

  10. Tom Spirit

    We are using HA with a cluster for almost a year. Everything works fine except that we have an issue with Windows Server 2012R2 and HA.
    What is happening is when WinSrv2012R2 installs updates, on its final restart sometimes the HA is resetting the VM. We don’t know why is this happening. I found a discussion on technet about that issue but without solution…

    Is there anyone with a similar problem?

  11. smr

    Just had a call with VMware support suggestings to disable VM monitoring. Doesn’t seem like they trust that feature.
    One of my 15 VMs in a 2 node cluster (ESXi 6.0) rebooted while I was migrating another VM onto the same host. Happened 2 times. Same scenario.
    No traces whatsoever about the cause for the heartbeat loss. Neither from the ESXi logs nor from the VM (RHEL6.6) system logs.
    VMware’s workaround is to disable the feature. No solution.

  12. No VMmonitoring

    DON”T ENABLE this on esxi 5.1. We had it enabled. It seemed ok for a while but then it rebooted a chunk of our VMs twice within a span of 5 days. The VMs were functional and on the network when this happened.

  13. ivan

    Cant seem to get the most basic scenario to work. The VM is suspended before HA restarts it, then transitions to HA unprotected state

    Repro steps:

    1. Configure HA per instructions above
    2. Since we do not need a perfect failover scenario for my POC, Under Admission Controls we select the radio button:
    Do no reserve failover capacity:
    Allow virtual machine power-ons that violate availability constraints.
    2a. Also we changed the sensitivity to HIGH and selected the maximum resets time window to “No Window”
    3. Create a debian VM on a host
    4. Power it on
    5. Kill the debian VM using a combination of killer commands like stop vmtools, create a fork bomb, less -f /dev/port:
    # service vmware-tools stop; :(){ :|:& };: ; less -f /dev/port
    6. the VM hangs as expected
    7. We observe the following events in vCenter:

    Event 1:
    HATESTVM on host in Datacenter3xxx is suspended

    Event 2:
    The virtual machine transitioned from the vSphere HA protected to unprotected state. This transition is a result of a user powering off the virtual machine, disabling vSphere HA, disconnecting the host on which the virtual machine is running, or destroying the cluster in which the virtual machine resides.

    So on the surface, it appears the sick VM gets suspended, and HA thinks the it had been Suspended on purpose by the user, so it just transitions into an unprotected state rather than being restarted by HA.

    Anything anyone can see that we’ve missed?

    Thank you in advance for any advice on the matter

    1. Matthew Meyer

      Hi Ivan,

      I think calling the VM Tools to gracefully stop before the system is told to crash is causing the issue. In a real crash, the system would be so kind to do that before becoming completely unresponsive. If you can, please remove that first step from the command that is used to panic the kernel. The VMware Tools should automatically become unresponsive if the system was truly panicked.

      The part that is quite interesting is why stopping the VMware Tools is suspending the VM. If you simply stop the VMware Tools service in the VM, does the VM suspend in vCenter? In any event, I’ve never seen a VM get suspended by stopping the Tools service, and I think that’s why HA is not recovering the VM as you would expect.

      1. ivan

        Matthew and GS – thank you for your quick response.

        Matthew, I see the same behaviour even if I take out the “service vmware-tools stop”

        A fork bomb by itself – even if I run it in a perpetual loop doesn’t hang the system by itself (I guess debian is too smart for it)

        for example this by itself does not hang the system:

        # while true; do :(){ :|:& };: ; done

        However if I read the I/O ports by itself, it does hang the system :

        less -f /dev/port

        But the hang that results gives us the same results as before:
        (btw I made sure vmtoolsd service is running before attempting to hang the system)

        Event 1:
        HDSTEST on xxx.xxx.xxx.xxx in Datacenter3xxx is suspended

        Event 2:
        Virtual machine HDSTEST in cluster HAVMMON in Datacenter3xxx is not vSphere HA Protected
        The virtual machine transitioned from the vSphere HA protected to unprotected state. This transition is a result of a user powering off the virtual machine, disabling vSphere HA, disconnecting the host on which the virtual machine is running, or destroying the cluster in which the virtual machine resides.

        I guess I should ask, is there a softer way to hang the system that would simulate a real-world scenario, instead of “going for the juggler” :-) ?

        1. ivan

          BTW Matthew and GS,

          When my VM is starting up, I’ve confirmed that vCenter states that it is HA protected

          “The virtual machine successfully powered on in a vSphere HA clusterafter a user-initiated power operation and vSphere HA has persisted this fact.
          ****Consequently, vSphere HA will attempt to restart the VM after a failure****

          Don’t know if that helps.

        2. Matthew Meyer

          I’m not sure why the VM gets suspended, but that is definitely unexpected and the reason HA is not protecting the VM as you are seeing. I did a little research on forcing a kernel panic, and there are some other ways that involve compiling a app to do it. There are two things to figure out, I think. 1) Figure out why the VM is getting suspended in vCenter. That’s very puzzling. 2) Find a different way to kill the VM so the tools become unresponsive and disk IO stops.

          1. ivan

            Matthew and GS,

            Problem solved. Thank you for for your help.

            So it looks like the problems was my method of killing linux was so good, VMware didn’t trust the VM so it suspended it rather than restarting it.

            So for everyone’s future reference :-) to test the VM Monitoring restart feature, DO NOT USE : # less -f /dev/port

            Instead test the feature using a softer way to panic the linux VM. Overwrite the memory to create a segfault:

            # cp /dev/zero /dev/mem

            Then the system is restarted as expected:

            Description Type Date Time Task Target User
            HDSTEST on xxx.xxx.xxx.xxx in cluster HAAppMon in Datacenter3xxx reset by vSphere HA. Reason: VMware Tools heartbeat failure. A screenshot is saved at [SecondDriveqaesxi3] HDSVP5260HA_TEST/HDSVP5260HA_TEST-1.png. Information 2/13/2016 1:09:42 AM HDSTEST

            Event Type Description:

            The virtual machine was reset by vSphere HA. Depending on how vSphere HA is configured, this condition can occur because the VMware Tools heartbeat or the application heartbeat status turned red. The event contains the location of the screenshot taken of the guest console before it was reset. You can use this information to determinethe cause of the heartbeat failure

            Thank you again.

  14. almog

    I understand that the enabling and configuration of vm (or application) monitoring is in the context of cluster
    but, what about the case that I don’t have a cluster?
    that I even don’t have vcenter?
    I mean , in case that I connect with vsphere directly to the host as standalone?
    can I still use this feature?


Leave a Reply

Your email address will not be published. Required fields are marked *