vRealize Operations Troubleshooting Powered by Machine Learning

With the release of vRealize Operations 8.0, we continue to deliver on the vision of Self-Driving Operations powered by advanced AI/ML with the introduction of the Troubleshooting Workbench for faster time-to-identify and time-to-remediate issues in your Software Defined Datacenter. In this post, I will review the capabilities and features of the Troubleshooting Workbench.

Start Troubleshooting

First, there are many ways to get to the Troubleshooting Workbench. You can go there directly from the Quick Start page after login to vRealize Operations by clicking the link on the left-side navigation or from the Troubleshooting card.

Accessing the Troubleshooting Workbench from the Quick Start page

Using this method, you’ll land on the Workbench home page. From here, you can search for an object to begin troubleshooting, continue an active troubleshooting session or use a recent search.

The Troubleshooting Workbench landing page where you can start new workbenches, resume active troubleshooting or leverage recent searches

You can also get to the Troubleshooting Workbench via any object’s detail page. For example, from this MySQL Database application object:

Navigation to Troubleshooting Workbench from an object details page

In that case, you’ll be taken to a new troubleshooting session with the default scope and time range (more on that later).

A Troubleshooting Workbench launched from the object details page. It defaults to level 1 relationships and the last six hours time range. And finally, you can launch the Workbench from an alert. But before you do, notice that “potential evidence” from the Workbench is displayed in the alert details, so you can do some troubleshooting right away in the alert context.

The Alerts details now includes a tab for Potential Evidence provided by the Troubleshooting Workbench When you launch into the Workbench from an alert, the scope is the default setting, but the time range is adjusted to two hours prior to and 30 minutes after the alert trigger.

Workbench time range is set to 2 hours before and 30 minutes after the alert triggered. The scope is still set to the default level 1.

No matter how you get to the Workbench and start a session, you now have the power of machine learning providing you evidence and a toolbox of helpful features to speed your troubleshooting analysis.

Potential Evidence

The Workbench opens with a couple of controls for scope and time range, and a display of “potential evidence” for your consideration. The evidence consists of events, property changes and metric abnormalities. I’ll explain each of these categories in more detail later in this blog, but for now let’s cover the basics of the Workbench interface.

The Troubleshooting Workbench main page, Potential Evidence Using the screenshot above, here’s a key to understanding the Workbench:

1 – Time range. By default, the time range will be 6 hours. The exception to this is if you navigate to the Workbench from an alert, which will set the time range from two hours prior to the alert start to the current time. You can adjust the time range to anything you like here.

2 – Scope. The scope is the impacted object and its related objects. The default is one level up and one level down. You can adjust the scope using the plus/minus icons or you can click on the level you wish to include. You can also customize the scope to specific relatives as well as peers by using the “CUSTOM” link. Let’s look at how that link works to refine your scope.

Below is the scope before customization.

The default scope is set to All Objects

Clicking the link for custom scope brings up an Advanced Relationship widget. I will collapse the parent objects of the virtual machine first, by clicking on the magenta relationship counter icon.

Filtering out the parent objects

Now with the parent objects removed, I’ll add the peer objects for the virtual machine by hovering over and clicking the pop-up link.

Adding peers of the impacted object

I have my desired custom scope. I click OK to return to the Workbench.

A custom scope with level 1 relationships but filtering parent objects and including peer objects.

Now vRealize Operations will use this scope for gathering Potential Evidence.

The result of custom scope filtering

To remove the custom scope and return to default, simply click the remove filter icon as shown below.

Remove the custom scope filter by clicking the funnel icon

3 – The active scope is shown in this panel. You can adjust the navigation tree from the default of all objects. For example, you may want to focus on a virtual machine and its related storage infrastructure. In that case you can change the scope tree to vSAN and Storage Devices to limit the Workbench to only those object types.

4 – Potential Evidence. This provides the result of vRealize Operations gathering of events, property changes and anomalous metrics within the scope and time range. You can select any metric for further analysis in the Metrics tab by clicking the push-pin icon. You can also dismiss any evidence you do not find compelling.

Potential Evidence breaks down as follows:

Events are based on changes in metrics using historical data. vRealize Operations learns about the usual behavior of all metrics collected and sets “Dynamic Thresholds” for them based on the time of day and day of the week. Metrics which have breached their Dynamic Threshold will be shown here. Additional consideration is given to show only events with a negative sentiment (e.g. “failure”, “down”, etc) and the events are ranked by the information level of the symptom (i.e. info, warning, immediate, critical) and the current state, start time and number of event triggers.

For example, below you can see an event based on a Dynamic Threshold (DT) violation. The metric breached the DT just before 6AM on this chart (the DT is shown as the grey shaded area on the chart).

An event based on a DT violation

Property Changes is self-explanatory, but you should be aware that property changes for in-scope objects created during the time range will not be shown, to reduce the noise level. Property changes are ranked based on the proximity to the end of the time range and the frequency of changes.
Anomalous Metrics are statistically significant changes detected for all objects in scope during the selected time range. Metrics are considered anomalous if they result in a change of the mean of the datapoints for each window (i.e. not a single, large spike). Detection of anomalous metrics is done by analyzing data points through a sliding window, which is 1/4^th of the time range.

Example of an anomalous metric with the sliding window analyses overlaid for clarity

Note that Anomalous Metrics and DT violations are distinctly different. Anomalous metrics are not based on any historical behavior outside of the time range, and they also represent a significant change in the metric, not a momentary spike, as a DT violation might. Both are helpful in troubleshooting if you understand what they are showing.

5 – Troubleshooting tabs. Once you have reviewed the potential evidence you can dive in and start doing further analysis. I will cover those in the remainder of this blog post.

Alert Analysis

Let’s start our analysis in the Alerts tab. If you have used the main Alerts tab in vRealize Operations, this will look familiar to you. But there are some important differences to note when you use the Alerts tab in the Workbench.

First, the alerts are shown within the time range and scope. These are active alerts for all objects in scope. In addition to the usual Alerts tab controls to filter, group and take actions on alerts and symptoms, you can be more selective about which object’s alerts you wish to have on this tab.

For example, clicking the “Remove All” icon will clear the alerts list.

Remove All alerts from the list

Now, I can click any objects in the scope to add their alerts to the list. Just click once and wait a moment and the alerts will appear.

Adding specific object's alerts to the list

You can go back to the complete list at any time by clicking the Show All icon.

Restore all related object alerts to the list with Show All

If you select and alert and go to the details in Workbench context, you can pin any metric symptoms to the Metrics tab for further analysis.

Let’s look at the Metrics tab next.

Metric Analysis and Correlation

The Metrics tab works very much like the Metrics tab from any object details page, but if you are new to vRealize Operations here’s a troubleshooting demo that gives you some ideas on how to use the Metrics tab in general (starting with version 7.5).

One key difference here is there’s no Advanced Relationships widget on the Metrics tab. This is because the scope is used to show the related objects and you can choose from the scope to select objects and metric charts.

Select an object from the scope to change to the available metrics for that object

Be aware that the metric charts will default to “Last 7 Days” timeframe, not the Workbench time range, so you may have to adjust using calendar controls on the metric chart toolbar to match the charts up with the time range.

Recommend you set the date controls to the same time range for easier viewing

Don’t forget that you can use the Metric Correlation feature of the metric charts to find additional, related metric behavior – this is a very powerful capability when used with the Workbench.

The Metric Correlation feature was released with 7.5 and provides powerful, ML-driven troubleshooting capability

Events Analysis

Just like the Alerts and Metrics tab, Events works the same way in the Workbench as it does on an object’s details page. Here you’ll see the active events for all objects in the scope, and just as with Alerts, you can clear the event list and select specific objects from the scope and view their events only.

The Events tab in Workbench

You can also select events from the Events subtab or the Timeline subtab and pin the associated metrics to the Metrics tab.

Event metrics can also be pinned to the Metrics tab

The Last Mile in Troubleshooting

You probably already know about vRealize Log Insight and how it integrates with vRealize Operations to provide “Last Mile Troubleshooting” using powerful log analytics with logs from your Software Defined Datacenter, OS, applications, physical hardware and more.

If you have this integration enabled, the Logs tab allows you to browse related logs by opening the vRealize Log Insight user interface to the Interactive Analytics tab and applying a filter based on the selected object from the scope.

Using vRealize Log Insight for troubleshooting.

You can change the filter easily by selecting other objects from the scope. Keep in mind that the filtering only applies to vSphere object types. But, you can easily add your own filters to narrow down the content you need to see.

Your Workbench Powered by Machine Learning

So, hopefully after this introduction to the Troubleshooting Workbench you’re ready to try it for yourself. I think you will agree that having the assistance of vRealize Operations machine learning engine to surface potential evidence shortens the time required to find issues.

One other use case for the Workbench is verification of resolution. For example, you may have found your issue and then applied some fix to remediate it. How do you really know that fix worked?

You can save the Workbench and return later, using it to verify that the intended fix did indeed work.

Minimize the Workbench to return later