By: Michael Steinberg and Pierre Moncassin
Deploying monitoring tools effectively is both a science and an art. Monitoring provides vast amounts of data, but we also want to filter the truly useful information out of these data streams – and that can be a challenge. We know how important it is to set trigger points to get the most out of metrics. But deciding where exactly to set those points is a balancing act.
We all know this from daily experience. Think car alarms: If limits are set too tight, you can trigger an alarm without a serious cause. People get used to them. They become noise. On the other hand, if limits are too loose, the important events (like an actual break in) are missed, which reduces the value of the service that the alarm’s supposed to deliver.
Based on my conversations with customers, vCOps’ out-of-the-box default settings tend to be on the tight side, sometimes resulting in more alerts than are useful.
So how do you make sure that you get the useful alerts but not the noise? I’ve found that assigning Key Performance Indicators (KPIs) to each VM is the best way to filter the noise out. So this post offers some tips on how to optimally use KPIs.
First, Though, a Quick Refresher on KPIs
By default, vCOps collects data for all metrics every five minutes. As part of its normal operations, vCOps applies statistical algorithms to that data to detect anomalies in performance – KPIs are outputs from those algorithmic measurements.
Within vCOps, a metric is identified as a KPI when its level has a clear impact on infrastructure or application health. When a KPI metric is breached, the object it is assigned to will see its health score impacted.
A KPI breach can be triggered in the following ways:
- The underlying metric exceeds a given value (Classic Threshold).
- The underlying metric is less than a given value (Classic Threshold).
- The underlying metric goes anomalous. This is a unique capability of vCOPs where a ‘normal’ range is automatically calculated so that abnormal values can be detected.
Typically, you would use one of these three options when setting a threshold, but combinations are also allowed. For example, you may want to set a classic threshold for disk utilization that exceeds a certain percentage. This can be combined with a dynamic threshold – where an alert is triggered if CPU utilization goes above its monthly average by more than x%.
Tips for Optimizing KPIs
KPIs provide the granular information that make up the overall health score of a component in the infrastructure, such as an application. The overall health score is a combination of statistics for Workload, Anomalies, and Faults.
Overly-sensitive KPI metrics, however, can cause health scores to decrease when there isn’t an underlying issue. In such instances, we need to optimize the configuration of vCOps so that the impact of anomalous metrics on health scores is mitigated.
Here are some ideas for how to do that:
Tip 1 – Focus on Metrics that Truly Impact Infrastructure Health
First, it’s good to limit the number of metrics you put in place.
With too many metrics, you’re likely to have too many alerts – and then you’re still in a situation analogous to having car alarms going off too often to be noticed.
Remember, overall health scores are impacted by any metric that moves outside its ‘normal’ range. vCOps calculates the ‘normal’ range based on historical data and its own algorithms.
Tip 2 – Define KPI Metrics that will Trigger Important Alerts
Next, you want the alerts that you do define to be significant. These are the alerts that impact objects important to business users.
For example, you could have a business application with a key dependency on a database tiers. An issue with a database or its performance would thus impact the user community immediately. To highlight these metrics, then, you’d want to focus on the set of metrics that can most closely monitor that database’s infrastructure setup KPIs.
Tip 3 – Use KPIs Across All Infrastructure Levels
In order to see the maximum benefit of KPI metrics, each metric should be assigned to the individual virtual infrastructure object (i.e. Virtual Machine), as well as any Tiers or Applications that the Virtual Machine relates to.
This is an important step as it makes the connection between the VM metrics and the application it relates to. For example, it may not be significant in itself that a VM is over-utilized (CPU usage over threshold), but it becomes important if the application it supports is impacted.
Example
Let’s assume a customer has a series of database VM servers that are used for various applications. The VM, Tier and Application assignments are illustrated below in the table.
VM | Tier | Application |
orasrv1 | DB | WebApp1 |
orasrv2 | DB | CRMApp1 |
orasrv3 | DB | SvcDesk1 |
The application team has specified that the CPU Utilization for these VMs should not exceed 90% over three collection intervals (15 minutes). Therefore, our KPI metric is CPU Utilization %.
The KPI metric is assigned to all of the resources identified in the table above. Each VM has the KPI assigned to it. The DB Tier within each Application also has the KPI assigned to it. For example, the DB tier within the WebApp1 application is assigned a KPI for the orasrv1 VM. Finally, each Application also has the KPI assigned to it. For example, the WebApp1 application is assigned a KPI for the orasrv1 VM.
With these assignments, health scores for the VMs, Tiers and Applications will all be impacted when the CPU Utilization for the respective VM is over 90% for 15 minutes. Virtualization administrators can then accurately tell application administrators when their Application health is being impacted by a KPI metric.
Take-Away
When it comes to KPI alerts, there are 3 steps you can take to help “filter the noise” in vCOPs.
1) Focus on a small number of metrics that truly impact infrastructure health.
2) Define KPI metrics that will trigger the important alerts.
3) Set up these KPI metrics consistently across infrastructure levels (eg VM, Application, DB), so that issues are not missed any particular level.
For future updates, follow @VMwareCloudOps on Twitter, and join the conversation by using the #CloudOps and #SDDC hashtags.