VMware Aria Operations

High Availability for Application Monitoring in VMware Aria Operations

This blog was authored by Tim George.

The January release of VMware Aria Operations gave us an exciting new feature.  We now have HA for application monitoring!  If you are familiar with application monitoring using Telegraf in VMware Aria Operations, you already know that it is highly dependent on Cloud Proxies.  Any data that is being collected from endpoints is pushed to VMware Aria Operations through these Cloud Proxies and application monitoring ARC adapters are the only adapters that can push data from endpoints (all other management packs use the pull method).  Previously, these ARC adapters didn’t support collector groups and the Cloud Proxy was a single point of failure for application monitoring.  If the Cloud Proxy fails, data from the endpoints wouldn’t reach Aria Operations.  We sought to reconcile this limitation.  To address the challenge, we added support for application monitoring through Collector Groups so that if one Cloud Proxy should fail, metrics can still flow from another Cloud Proxy in the Collector Group making this feature highly available. 

The first item we wanted to take care of was the creation of these Collector Groups.  We simplified the experience and made it much easier to add new groups and enable/disable high availability from within this UI.

Figure 1: From this new UI, we can enable High Availability (1), set our virtual IP that will be used (2), and check the Cloud Proxies to be added (3).

Once we have added a new Collector Group, we can now filter by these groups when we look at all our Cloud Proxies.  We can group our proxies by Collector Groups and see each of the Cloud Proxies that make up the group or look at only our ungrouped Cloud Proxies.

Figure 2: we can group by collection groups or no grouping at all

There also is a mechanism to retry configuration if there have been any changes in members of a Collector Group.  That is, whenever a Cloud Proxy is added or removed, we have options to “Retry Cloud Proxy Configuration” from this screen, as well as activating/deactivating data persistence.

Figure 3:  Options to activate/deactivate data persistence and retry proxy configuration

To talk about putting this into practice, we also need to talk about a few important characteristics of this new feature.  The first is that bootstrap/re-bootstrap of the Telegraf agent is required in order to use HA.  Older versions of the agent will not be able to handle the changes that have been made.  Of course, this can be done from within Aria Operations by going to “Environment à Applications à Manage Telegraf Agents”.   When installing/re-installing these agents, we will now get a different pop-up than what we used to see.  This pop-up allows us to install the agent and assign to a Cloud Proxy or Collector group based on if we wanted High Availability or not.  If utilizing high availability, we can select this radio button and select the Collector Group we wish to assign it to.

Once we have set our configurations, there is also some time needed to actually put these in the Collector Group and create the necessary relationships for any failover possibilities or when a failover occurs.  During a failover event, it can take up to three collection cycles for data collection to resume.  Most of the time this will be a quick transition, but with data persistence enabled, there is a maximum of one collection cycle where metrics might be lost.
 
Adding High availability is just one more tool in our already brimming toolbox to help us monitor our applications with Aria Operations, in order to minimize downtime and continue providing the services to our end-users without interruption.  Check out the Technical Overview: VMware Aria Operations – January for more features in this release.