Wavefront Metric Analytics Virtualization Vmware

Detecting VM Performance Anomalies With Wavefront

(Editor’s Note: Interested in trying Wavefront? Sign up here. Free trial)

 

Introduction

More than 80% of all workloads run on top of virtualized infrastructures. Monitoring the performance of the virtual machines that comprise this infrastructure, is important for achieving high application performance, improving cluster-utilization, and in lowering costs.

To aid performance monitoring, modern hypervisors such as ESX are highly instrumented and can collect 100’s of performance metrics per VM. While these metrics are extremely valuable in determining and diagnosing performance issues, their large number makes it a challenging to combine them to form a coherent picture of VM performance. The sheer number of these metrics also presents a challenge in collecting and storing them, and as a result, administrators are often restricted to monitoring only a small fraction of the metrics, and potentially missing out on insights that the complete set of metrics have to offer.

In this article, we shall see how to continuously collect, store, and analyze all the performance metrics collected from ESX. Doing so allows us to detect performance anomalies (periods when VM performance doesn’t match expected behavior), both in real-time and after-the-fact.

Monitoring VM metrics

The ESX hypervisor collects the CPU, Memory, and I/O resource utilization of VMs. These metrics are available through tools such as esxtop. On a typical vSphere 6.5 ESX host, for example, the total number of metrics reported by esxtop exceed 10,000. Due to the sheer data volume and velocity, collecting and storing these metrics continuously (say, every 30 seconds), is not trivial!

This is where Wavefront comes in. Wavefront is a time-series database that provides metric collection, storage, retrieval, and visualization. In addition to being very fast, it also supports a rich query language, allowing users to answer queries such as “which VMs are consuming more than 90% CPU over a one hour period?”.

Ingesting data into Wavefront is easy. Every metric value is represented as a tuple of (metric-name, value, timestamp, source=esxhost, VM=vm-name). This string representation is then simply sent to over a socket to a Wavefront proxy, and looks something like this:

esxtop.memory.free, 345, 15000003534, source=promd.east.eng.com, VM=mysql_vm

We collect esxtop metrics by periodically (every 30 seconds) parsing esxtop‘s batch-mode output (-b), with each column corresponding to a metric name. Additionally, we also “tag” metrics with their corresponding VM, for easier querying. Wavefront allows the metrics to be queried either through the web-UI, or by using the HTTP API.

Detecting anomalies in VM performance

The continuously collected esxtop metrics allows us to either go back in time and diagnose performance issues, or do real-time anomaly detection. Our anomaly detection pipeline is unsupervised and works in a black-box manner. That is, we detect anomalies in the performance of applications running inside VMs without requiring application-level performance metrics such as response-time and throughput. Instead, we use the “low-level” esxtop and CPU performance counters (using vmkperf) to infer about anomalies.

To detect anomalies, we use a sliding window of past metrics values, and employ multi-dimensional clustering to identify when the metric values deviate “too much” from past behavior. Thus we feed-in the esxtop metrics, and generate an anomaly score for any given VM. The anomaly scores indicate the likelihood that the VM’s performance deviates significantly from past behavior.

Putting it all together

We compute anomaly scores for each VM continuously, and them as a time-series in (you guessed it) Wavefront. Administrators and users can then set alarms to trigger whenever their VM’s anomaly score crosses a threshold, for instance.

In addition to the anomaly scores, we can also identify the metrics that exhibit the most anomalous behavior. We can select the top “k” such metrics, and highlight them on the metrics dashboard. This kind of metric-ranking is important, since we collect 100’s of metrics per VM, and displaying the most “interesting” metrics helps in root-causing performance issues.

All the esxtop raw-metrics and anomaly scores, coupled with Wavefront’s querying and visualization, make a powerful combination!