Abstract
Thanks to a code contribution from VMware, version 1.8 of the Telegraf metric collector fully supports pulling metrics from vSphere. Since Telegraf is the underlying metric collector for many of the metric sources available to Wavefront, this brings the full set of vSphere metrics to Wavefront.
In this article we will discuss why this is important and how various personas in an organization can use this to quickly troubleshoot and optimize their applications.
What is Wavefront?
Wavefront is a Software-as-a-Service (SaaS) offering from VMware for time series analysis. Thanks to its virtually infinitely scalable architecture, it can consume, analyze and correlate millions of data points per second. Wavefront supports data from virtually any source and is used by organizations for tasks ranging from advanced troubleshooting to application optimization and business process optimization. More information is available here: https://www.wavefront.com/
What is Telegraf?
Telegraf is an open source performance and health metric collector that has become something of an industry standard for metric collection. It owes much of its success to its policy of welcoming contributions from third parties and supports, at the time of writing, over 130 data sources ranging from hardware to application frameworks and middleware. Read more about it here: https://www.influxdata.com/time-series-platform/telegraf/
Doesn’t vRealize Operations already monitor vSphere?
Let’s address this one upfront: There is no doubt vRealize Operations is the gold standard for vSphere monitoring, alerting, capacity planning and cost analysis. Going after those use cases would be insane and is certainly not the reason we are doing this. Instead, it’s about augmenting the monitoring we are already doing in Wavefront with metrics from the virtualization infrastructure. One of the sweet spots of Wavefront is application monitoring. But applications don’t live in a vacuum. They all depend on some kind of infrastructure. And when we are troubleshooting an application, it’s sometimes useful to correlate application behavior against events in the infrastructure. The mantra of Wavefront is “the more data you send to it, the more useful it becomes”. The main strength of Wavefront is the ability to correlate huge sets of time series data to find patterns leading us to a root cause. Having access to virtualization infrastructure data is an important piece of that puzzle.
Another way of putting it is that Wavefront is your “First pane of glass”, i.e. a convenient tool for quick troubleshooting, preventing 15 people to have to stare at vCenter at a 3am Saturday conference call.
Overview of the Solution
The basic idea is very simple: We can point Telegraf to a vCenter and collect as many or as few metrics as we like. All we need is an address and login credentials to vCenter and the address of a Wavefront proxy and we’re ready to do some monitoring!
Metric points are tagged to be mapped to the typical vSphere concepts, such as host, cluster and datacenter. Thanks to this, we can easily write queries that e.g. look at sums and averages across clusters or datacenters. If you are familiar with Wavefront, you know the power it gives you by allowing you to slice the data across different dimensions in seconds. Let’s take it for a spin!
First, let’s just run a simple query showing the CPU run time across all of my VMs. Go ahead, move your mouse over the diagram! It’s real data. This, by the way, is one of the really cool features in Wavefront: I can take any chart and embed it as a live widget on any webpage!
That’s an awful lot of data! Upon closer inspection, find that it’s showing us data for every virtual CPU core. Since we don’t need that, we can add a filter to the query that only shows the average produced by vSphere. We do this by adding a “cpu=instance_total” to the query.
But what if that’s not what we want? Maybe we’re only interested in the virtual core that works the hardest. No problem, we can use the interactive Query Builder to aggregate the core metrics into a metric reflecting the busiest core.
Correlating Behavior
Looking at the chart above, it’s clear that something is causing CPU spikes about once an hour. Upon further inspection, it turns out this is isolated a VM called “freenas-01”, which happens to be running parts of the storage in my lab. Could it be that some behavior of a VM is putting load on freenas-01 once every hour, causing those spikes? The data is very noisy and the correlation is bound to be weak, but let’s give it a try. We are correlating CPU usage on freenas-01 with CPU usage of every VM in the system and picking the top one. In fact, we’re picking the top two, since freenas-01 will of course have perfect correlation with itself.
It turns out we have a fairly weak but significant correlation with vc-01, which happens to be my vCenter. This is probably due to vCenter performing some housekeeping tasks every hour. The correlation is bound to be fairly weak, since there are a lot of workloads using the FreeNAS, but by singling out the top correlation, we’ve found a candidate for further exploration.
Correlate Applications to Infrastructure
We’ve shown how you can easily correlate various part of the virtual infrastructure to find the root cause of some behavior. Since Wavefront stretches into the application domain as well, we can easily extend this to correlation between applications and infrastructure. Let’s look at an example:
Our database cluster is experiencing spikes in response times at random intervals. We start by correlating database engine performance to the performance of underlying data stores. As you can see, there’s a clear correlation between datastore performance and database response time spikes.
It looks as if the spikes in response time happens when the datastore is handling a high number of IOPS.
But what a minute! Some of the databases don’t seem to experience those response time spikes at all. Look at the graph above, and you’ll see a cluster of database servers very close to zero response time. What do they have in common? Let’s break it down per host and see if there’s a difference. To do this, we use the “sum” function in Wavefront with a grouping based on host name. We’re also applying some smoothing using a moving median to remove some insignificant spikes and noise.
Here’s out smoking gun! Access from esxi-01 is six time slower. Further investigation in Wavefront shows that we have network card locked to 1GB on that host. This shows the power of collecting metrics across the stack and applying advanced correlation.
Dashboards and Installers
The next update of Wavefront will feature a full set of vSphere dashboards, as well as a standard installer. These are some samples of the dashboards. Note that the final dashboards may be somewhat different.
What’s Next?
While the installer and dashboards are still to be released, you can download the latest release from here: https://github.com/influxdata/telegraf/releases
The Wavefront sales engineering and customer success can help with any dashboards and running queries.
Take it for a Spin?
Interested in this plugin but don’t yet have a Wavefront subscription? No problem! Sign up for a demo account here: https://www.wavefront.com/sign-up/