Uncategorized

Automating App Monitoring

Do you want to offer vRealize Automation Service Broker users the option to have their applications monitored by vRealize Operations automatically?

vRealize Operations provides application monitoring to provide guest OS and application service metrics in context with the virtual machines and other resources related to your business applications. This requires a Telegraf agent to be installed on the virtual machine. To make it easier for administrators to deploy and manage agents, vRealize Operations provides full lifecycle management of the Telegraf agent including initial install, upgrades, plugin configuration and even removal of the agent. This is available in the vRealize Operations user interface, but there also exist APIs for agent management.

In this blog, I will explain a vRealize Orchestrator workflow that I created to automate the installation and configuration of the agent. I will also show how I have implemented this workflow as an ABX event subscription to fully onboard a new vRealize Automation deployment into vRealize Operations, including application monitoring if desired.

The workflow package is available here or here if you would like to customize it for your own use. If you are not interested in the details, you can find the instructions on how to install and configure in the repo readme.

My Use Case

In our demo environment, we have a shopping cart application that we deploy and then manage with vRealize Automation, vRealize Operations and vRealize Network Insight. It is a simple two-tier application that has a web server and a database server (there is also a load generating server deployed to create network traffic and resource usage, but it isn’t strictly a part of the application – it just simulates usage).

While vRealize Operations automatically discovers and monitors the virtual machines, I wanted to provide a way for the requestor to optionally monitor the guest OS and application services using Application Monitoring in vRealize Operations. This involves installing and configuring the Telegraf agent on the virtual machine guest OS, which can easily be done via the vRealize Operations UI. But, what if the requestor doesn’t have access to vRealize Operations?

Graphical user interfaceDescription automatically generated

This workflow automates the end-to-end installation of the Telegraf agent through vRealize Operations REST APIs and the result is application service metrics available without any manual intervention.

Notice in the screen shot the deployment is named “Autoscale Web Demo” – in a future blog I will show how I use the solution below with a new workflow to automatically scale the web tier in and out as demand changes.

Overview of the Workflow

The workflow is comprised of three main tasks.

  • First, the virtual machine must be monitored in vRealize Operations before an agent can be installed. Once discovered, the agent can be installed.
  • Next the agent bootstrap is started and monitored to ensure there are no problems.
  • After bootstrap, application services are discovered by the agent and reported to vRealize Operations. At this point, application plugins can be configured.

The workflow takes one input and that is the input parameters from the ABX subscription. If you are new to ABX in vRealize Automation, I recommend this blog for an overview. In my example, I am kicking off the agent installation after each virtual machine has been deployed, so I am using the “Compute Post Provision” event topic. The parameters in that topic provide my workflow with everything needed to begin the install and configuration process, most importantly the virtual machine name and UUID in vCenter.

The other required information, such as the vRealize Operations address and OS credentials, are stored in a configuration element, as seen here:

Graphical user interface, text, application, emailDescription automatically generated

These are bound to the workflow variables as shown here:

Graphical user interface, application, TeamsDescription automatically generated

If you take time to examine the workflow elements, you will notice I used Javascript for the script elements and Python for the actions. There is absolutely no technical reason for this, I just wanted to play around with the polyglot feature of vRealize Orchestrator! I have included the Python scripts in the repo for those interested.

Verify Resource in vRealize Operations

Before an agent can be installed, vRealize Operations must have the virtual machine in its inventory. So, the first task is to verify that.

DiagramDescription automatically generated

The first order of business is to parse the input parameters from vRealize Automation. This is trivial since we only need two parameters here, the virtual machine UUID and name. This is needed in the subsequent steps to verify the new virtual machine is available in vRealize Operations.

When a new virtual machine is deployed, vRealize Operations will discover it in the vCenter inventory and begin monitoring. This takes time, roughly 10 minutes, because vRealize Operations first must collect the new inventory and on the next collection cycle begin collecting metrics for any new objects. By default, vRealize Operations collection cycles are 5 minutes. Thus, there is a sleep routine and a counter to check every 5 minutes for the new virtual machine in vRealize Operations. If it isn’t found after 15 minutes, the workflow throws an exception and exits.

The sleep timer value is set as a global variable (sleepTime) if you want to change it (for example if you have different collection intervals).

Bootstrap the Telegraf Agent

For vRealize Operations Application Monitoring, the agent is installed in a process known as “bootstrapping” and involves the deployment of a SaltStack minion in the guest OS. For this reason, we need a password-less sudoer credential. In my workflow, I am simply providing this via configuration element as I covered above. There are certainly other (and better) ways to do this, depending on your needs (such as retrieving from a secret store in vRealize Suite Lifecycle Manager Locker service). But, for my example it works fine.

DiagramDescription automatically generated with low confidence

Kicking off the bootstrap is easy, thanks to APIs available in vRealize Operations. The bootstrap takes some time to complete, typically 5-10 minutes and again the sleep timer is employed here. Note that I also include a counter here, even though the bootstrap will provide a failure status that I can use for error handling in the workflow. But I have seen cases where the bootstrap gets hung in a continuous loop so I have added a counter to throw an error if the bootstrap is taking much longer than it should. It is a rare occurrence, but it can happen.

Configure Application Plugins

Once the agent has been installed successfully, Application Monitoring will begin to collect OS metrics and also look for any supported application services running on the guest OS. As with other steps, this takes a collection cycle before any supported services are available for configuration.

As such, another sleep is added. After the sleep timer expires, the getDiscoveredApps action checks to see if any apps have been discovered and this is evaluated by a Decision element to see if there are more than six applications. By default, there are six application plugins available for an agent object. These are the standard custom plugins (TCP check, HTTP check, script monitor, etc). If the workflow finds more than that, it is a sign that actual application services have been discovered.

DiagramDescription automatically generated

I must admit, there are some potential gaps here to be aware of and you may wish to modify the workflow. Allow me to explain.

In my use case, I am only installing the agent on virtual machines where I fully expect application services to be discovered. I will discuss this later when I cover how the ABX subscription is configured. So, I am not handling a generic situation where there may not actually be any supported services running on the guest OS (for example, if I just want to monitor the guest OS metrics via Telegraf).

If you decide that you want to use this for installing on any virtual machine, you should take that into account and modify the workflow to handle that case appropriately (e.g. breaking out of the app discovery loop after an adequate amount of time). I will let you consider this a homework assignment!

Since I am expecting certain services to be discovered (in my case Apache web server and MySQL) I am only handling those cases in the switch element. The way I am doing this is not very clever, I must admit. I am only evaluating the virtual machine name to determine if I should continue a path of configuring an agent plugin. Ideally, a virtual machine tag or other attribute could be used.

Once it is determined that a plugin should be configured, the workflow verifies that the expected application service has been discovered and then executes the configuration.

Within another 5-10 minutes, application metrics will start showing up in vRealize Operations.

Configuration of the ABX Subscription

The subscription in vRealize Automation extensibility is configured to meet the desired outcome of my use case; allowing a requestor to check an option on the catalog form to enable application monitoring for a deployment. This results in a custom property for the virtual machines of “agent = telegraf” which is used as the condition to trigger the subscription.

Graphical user interface, text, application, emailDescription automatically generated

The event topic is Compute Post Provision, and I am not blocking the deployment. That would add significant time to the deployment and honestly, I do not think it is needed. Just because the application is not being monitored there is no reason it should not be deployed.

Setting a custom property and configuring inputs for a cloud template is beyond this scope of this post. But, here is how I decided to implement in the request form.

If the requester ticks the option to enable OS monitoring, the custom property is applied to the virtual machines, triggering the agent installation event subscription.

Try For Yourself

Everything you need is included in a repo that I’ve also shared on VMware Code Sample Exchange. The readme contains exhaustive instructions on setup and configuration, but if you run into a problem just comment here or contact me on Twitter @johnddias. Feedback and suggestions are welcome!