Architecture

What is vCenter Server Watchdog?

If you’ve done any research into the high-availability options available for vCenter Server 6.0, hopefully you have had a chance to read the VMware vCenter Server 6.0 Availability Guide written in collaboration with Technical Marketing and Global Support Services as well as KB 1024051. And you might have noticed particular sections that refer to the vCenter Server Watchdog. But what exactly is the vCenter Server Watchdog?

Enabled “out of the box” in 6.0, the vCenter Server Watchdog provides better availability by periodically verifying the status of vCenter Server.  It does this in two ways:

  1. The PID Watchdog monitors the processes running on vCenter Server
  • The API Watchdog uses the vSphere API to monitor the functionality of vCenter Server.

If any services fail, the Watchdog attempts to restart them. If it cannot restart the service because of a host failure, vSphere HA restarts the virtual machine running the service on a new host.

That’s sounds slick, right? Well, let’s dive in and take a look at each of these watchdogs in detail.

PID WATCHDOG

First up, is the PID Watchdog. A watchdog initializes alongside each vCenter Server service at runtime. The PID Watchdog only monitors services that are actively running. Meaning, that once a service is gracefuly stopped, the watchdog will no longer monitor or restart the service. The PID Watchdog detects only that a process with the correct executable is in the process table but it does not determine if the process is ready to service requests (e.g. vSphere Web Client.) – that is left to the API Watchdog (more on that later).

There are four PID Watchdogs found vCenter Server 6.0 depending on the service it protects and the platform form factor:

  1. vmware-watchdog:
    This watchdog detects failures and restarts all non-Java based services on the VCSA.
  2. Java Service Wrapper:
    This watchdog detects failures and restarts all Java based services on the VCSA and Windows.
  3. Likewise Service Manager:
    This watchdog detects failures and restarts all non-Java (C) based platform services.
  4. Windows Service Control Manager:
    This watchdog detects failures and restarts all non-Java based services on Windows.

Each vCenter Server process has a separate PID Watchdog process associated with it. In this post, we will take a look at those in the vCenter Server Appliance.

vmware-watchdog

This watchdog is a shell script (/usr/bin/watchdog) found on the VCSA that is used to detect a service failure in non-Java (C) based services on the appliance form factor. A service start automatically starts the Watchdog along with the service itself. Let’s search for the running processes that match for “vmware-watchdog.” 

Let’s break that down a bit into something more readable: 

Service Process Name Virtual Machine Restart? Minimal Uptime Number of Restarts
Reverse Proxy rhttpproxy No 30 seconds 5
vCenter Management Web Service vws No 30 seconds 5
Syslog Collector Syslog No 30 seconds 5
vPostgres Database vmware-vpostgres No 5 minutes 2
vCenter Server vpxd Yes 1 hour 2
VSAN Health Check vsan-health No 10 minutes 10

As an example, here we can see that vmware-watchdog is running with a couple of parameters, which differ for each service process. Let’s dig into the VPXD process since it’s the most important service. It shows the following parameters:

What the above process parameters result in is the following: the service, named vpxd (-s vpxd) is started, is monitored for failures and will be restarted twice (-q 2) at most. If it fails for a third time within a minimal uptime of 3600 seconds/one hour (-u 3600) the virtual machine will be restarted/rebooted (-a).

A full list of the parameters that may be used by the vmware-watchdog is provided below:

*** Note: The details provided above are for education purposes only. Do not make changes to service parameters or the vmware-watchdog script unless instructed to do so by VMware Global Support Services.  ***

Java Service Wrapper 

The Java Service Wrapper is a watchdog used to detect service failures and restart Java based services. It is based off the Tanuki Java Service Wrapper, a 3rd party service wrapper that enables a Java Application to be run as a Windows Service or UNIX Daemon and allows for the health monitoring an application and JVM. Let’s search for the running processes that match for “tanuki.”

When a Java-based services for vCenter Server starts, it automatically starts a wrapper process to monitor the service process and its JVM. The wrapper process restarts the JVM if it crashed and if the wrapper process crashes.

Likewise Service Manager

Now, let’s dig into the Likewise Service Manager watchdog. The a 3rd party Likewise Open stack from BeyondTrust includes the Likewise Service Manager and is instantiated by the lwsmd daemon. Aside from the services that come with the Likewise stack (such as lsass, netlogon, lwio, …), the Likewise Service anager is responsible for the VMware Directory Service (vmdir), VMware Authentication Framework (vmafd, which contains VECS), and VMware Certificate Authority (vmca).

Likewise Service Manager monitors the processes it starts and will restart them if they terminate unexpectedly. This means if a service like vmdir crashes, exits due to an error, or is told to terminate it by a process outside of the Likewise Service Manager, it will be restarted.

The command to list, start, stop and restart services managed by the Likewise Service Manager is ‘/opt/likewise/bin/lwsm’.

Let’s list all the processes managed by Likewise Service Manager that match for “vm” and their status:

Here we see that VMware Authentication Framework (+VECS), VMware Certificate Authority and VMware Directory Services are up and running.

Additional commands for Likewise Service Manager daemon include:

Now let’s take a look at the info for the VMware Directory Service using the info command:

Notice that Likewise Service Manager is also aware of any dependencies and will stop / start those as needed.

API WATCHDOG

Next up, is the API Watchdog. This watchdog checks the status of APIs for the VPXD service. If the APIs are not running, the API Watchdog will attempt to restart the service two times. If the service restarts do not solve the issue, the API Watchdog will call for the restart of the virtual machine.

During an initial deployment of the VCSA – new or upgrade – the API Watchdog is in a maintenance mode and will only be active after all of the ‘firstboots’ are completed and all services have come online. Firstboots are the scripts where vCenter services are injected into the VCSA during the appliance deployment. After which, every 5 minutes the watchdog is invoked to verify that the API for VPXD is accessible. Essentially this watchdog is using the VIM API to authenticate to VPXD, request the Tag type associated to the rootFolder property on the Folder managed object. If the check fails the API Watchdog will call the PID Watchdog to restart VPXD. (Note: For a vCenter Server deployment on Windows you must reboot . )

Thus, when the PID Watchdog takes over and performs the actions follow in conjunction with the API Watchdog:

  • First Service Failure Action = Restart Service
  • Second Service Failure Action = Restart Service
  • Third Service Failure Action = Reboot Virtual Machine
  • Failure Count Reset = 3600 seconds (1 hour)

Before a service restart and also before a virtual machine reboot, the API Watchdog generates support bundles for further investigation. These support bundles are stored in /storage/core/*.tgz on VCSA and in C:ProgramDataVMwarevCenterServerdatacore*.tgz on vCenter Server for Windows form factor.

The API Watchdog is also referred to as “IIAD” (Interservice Interrogation and Activation Daemon.) The configuration settings for the API Watchdog are stored in a JSON file named “iiad.json” and can be found in the /etc/vmware/ on the VCSA or C:ProgramDataVMwarevCenterServercfgiiad.json on the Windows form factor.

Let’s take a look at the contents of the iiad.json configuration file:

So what exactly do these parameters mean? Let’s take a look at each of these below:

  • requestTimeout– is the default timeout for requests in seconds.
  • hysteresisCount– allows failures to gradually age off. Every hysteresisCount runs of the API Watchdog, the failure count will be reduced by one.
  • rebootShellCmd– is a user supplied command to run before rebooting the VM.
  • restartShellCmd– is a user supplied command to run before restarting a service. It will be passed an argument of the service name.
  • maxTotalFailures– is the number of total failures across all monitored services required before a virtual machine reboot will occur.
  • needShellOnWin– determines whether to run service-control with shell=True on Windows.
  • watchdogDisabled– controls whether the API Watchdog is disabled.
  • watchdogDisabled– controls whether the Watchdog for VPXD is disabled.
  • createSupportBundle– controls whether to create a support bundle when a service restart or VM reboot is needed.
  • automaticServiceRestart – indicates whether to restart services upon failure detection, or merely to log the failure.
  • automaticSystemReboot– indicates whether to reboot the virtual machine when sufficient failures are detected, or merely to log the recommendation.
  • maxSingleRestarts – is the upper limit on the number of times the API Watchdog will attempt to restart a failing service.
  • maxSingleFailures– is the number of failures required to trigger a service restart.

*** Note: The details provided above are for education purposes only. Do not make changes to any of the above parameters unless instructed to do so by VMware Global Support Services.  ***

In addition to the support bundles before a service restart or a virtual machine reboot, it also logs its activities to /var/log/vmware/iiad/* on the VCSA and to %VMWARE_LOG_DIR%/iiad/* on the Windows form factor.

And there you have it, a little bit of a deep dive on the “out of the box” watchdog functionality for vCenter Server 6.0. By managing and periodically verifying the status of vCenter Server processes with the PID Watchdog and the vCenter Server API with the API Watchdog, we provide better availability for the services and a faster RTO.