The Lowly Metric Has Its Day in the Sun

By Rich Benoit

Back in the day, I would have killed for a tool like vCOps, an analytics tool that uses dynamic thresholds to make sense of the myriad activity metrics that exist in an IT environment. Without dynamic thresholds that identify normal behavior, admins like myself are forced to use static thresholds that never seemed to work quite right. Static thresholds tended either to be set too low, resulting in false positives, or too high, so that by the time they were tripped, the support desk had already started receiving calls from disgruntled users.

Tried, but Failed

One approach I tried in order make sense of the cloud of data coming from multiple monitoring tools was to combine several metrics to get a more holistic view. Combined metrics also rely on static thresholds and are similarly plagued with false positives. But, they introduce the additional problem of having to try and figure out which of the underlying metrics actually caused the alarm to trip.

Another approach I tried was using end-user experience monitoring, or end-to-end application monitoring. Instead of trying to estimate the performance of an application by looking at the sum of all of its components, I could instead look at the simulated response time for the typical user and transaction. Another end-to-end monitoring tactic was to employ passive application sniffers that would record the response time of transactions. But with both approaches, I was still dependent on static hard thresholds that were invariably exceeded on a regular basis. For example, it wouldn’t be unusual for an application to exceed its 2-second response time goal during regular periods of peak usage. So I had to know when it was normal to exceed the allowed threshold. In other words, I had to know when to ignore the alarms.

Static thresholds also impacted performance monitoring. Other admins would ask, “Did this just start?” or “Is the performance issue the result of a change in the environment?” The monitoring tools wouldn’t provide the needed data. So we would have to roll up our sleeves and try to figure out what happened. Meanwhile the system would be down or just struggling along. Many times the problem would go away after a certain amount of time or after a reboot, only to resurface another day.

In the end, except for a few cases, we just turned off the monitors and alarms.

A Better Approach

That is why I would have killed for vCOps. vCenter Operations Management Suite is built on an open and extensible platform that works with physical and virtual machines. It is a single solution works with a variety of hypervisors and fits either on-premise or public cloud environments.

It collects and stores metrics over time and works behind the scenes to establish dynamic thresholds. It employs around 18 different algorithms that compete to best fit any one of the millions of metrics it can track. Some algorithms are based on time intervals and others on mathematical models.

With vCops I can now designate specific metrics as KPIs for additional granularity. For example, the tool would learn that it is normal for response times to be in the 2 to 4 second range on Monday mornings, but if it exceeds the normal range, above or below, I can now have a KPI Smart Alert generated.

Another thing that I can use is the Early Warning Smart Alert that detects change in the environment when too many anomalies occur, such as when too many metrics are outside their normal operating range. I can use the various dashboards and detail screens to view the metrics over time, so that instead of wondering whether the issue is the result of a capacity trend or something changing / breaking, I can look and quickly see, “Oh, there’s the problem. Something happened at 1:15 on system X that caused this service to really slow down.”

Now, after more than 20 years in IT, I can finally start to use the multitude of metrics that have been there just waiting to be leveraged.

To get the most out of monitoring tools consider using vCops range of capabilities, including:

The ability to track KPIs within the infrastructure, such as Disk I/O or CPU Ready, or leverage the vSphere UI so that you know if your infrastructure has additional capacity or not.
Various KPI Super Metrics within the application stack (e.g. cache hit rate or available memory) that alert you when things are outside of a normal range.
The power to see exactly how an environment is performing on a given day, and the ability to isolate which component is the source of the issue.
The means to track and report the relative health of not only your components, but your services as well, without having to view everything as up or down at the component level and guess if the application or service is OK.

And it’s all possible because we can now actually use the lowly metric.

For future updates, follow @VMwareCloudOps on Twitter and join the conversation using the #CloudOps and #SDDC hashtags.

Related Articles

On Demand: VMworld 2021 Breakout Sessions

VMworld 2021: Featured Sessions (AMERICAS)

VMworld 2021: Featured Sessions (EMEA)