As noted in our Matrix-inspired first blog in this series, Wavefront by VMware (Wavefront) histograms are a new metric data type that empower you to perform accurate percentile calculations across aggregate sources, including high-velocity / high-volume metrics at sub 1-second intervals. Latency is a great example of a high-velocity metric that’s important to measure with a deeper understanding of distribution, as its variability will impact end user experience and SLAs.
Measuring High-Velocity Application Latencies is Challenging
Working with high-velocity metrics like latency is a challenge for DevOps. Many engineers use tools to aggregate latency measurements from hosts and forward a predetermined percentile calculation at a preset interval. With this crude approach, they don’t include the raw measurements or deeper distribution information with the stored time series. If you tried this, you might have gathered a series of the calculated 99th percentile values reported at 1 pps, or another set interval. But without more context about the distribution that went into those calculations, you’d be limited. You wouldn’t be able to further aggregate this latency percentile metric across multiple sources and you wouldn’t be able to calculate other percentiles thereafter to get a practical understanding of how your system performs across the full spectrum of your users and happenings.
Wavefront by VMware Metric Histogram Metric Data Type Retains Critical Application Performance Detail
Wavefront by VMware histogram functionality solves these problems by using your raw source data to calculate and store the distribution of a metric. This distribution is effectively a histogram data point that’s forwarded to and stored in the Wavefront by VMware cloud for analytics, alerting, and visualization. With metric distribution details stored as histograms, you can calculate any percentile on-demand. Moreover, Wavefront by VMware enables you to combine histogram points and calculate percentiles across multiple sources and any tagged dimension.
Here’s how it works:
First, raw metric data is sent to the Wavefront by VMware proxy. This data can come from multiple collection agents and at rates faster than 1 pps. These metrics are streamed in the same data format that Wavefront utilizes for “regular” metrics.
Within the proxy, Wavefront by VMware calculates a distribution of all the raw metric data over multiple set time intervals (minute, hour, or day). This distribution includes centroid values, which here are an array of latency ranges and then a count of measurements that fell into each range. For a latency metric histogram over a timed interval of say 1 minute, you might get something that looks like this:
This is where the histogram data type gets its name, but it’s a lot more than a visualization. In fact, the visualization isn’t really the crucial part. The crucial part is that we’re passing on an array of values for a single histogram data point – the associated distribution. To get a better idea of what this data point looks like, have a look at just the centroid pair values and counts from above:
Represented as an array of values, it’s reduced to something like this:
{1,3; 2,4; 3,35; 5,50; 10,45…}
Now, you can start to see how this is starting to look like a single data point (or record). This first number in each pair represents the centroid, here, a latency range. The second number represents the count of raw data points (latency measurements) that fell into that range. This array of values is how a full distribution of metric data over a set interval is represented within one metric histogram point. It’s also pretty easy to see how we can then extract any percentile from this distribution information.
Now, imagine collecting metric histogram data for multiple, 1-minute intervals. Have a closer look at the data that we’d have stored in Wavefront for the first 6 minutes:
And here’s a simple view on this data after applying a few basic percentile ts() functions:
This is a very simple, yet illustrative example. In Wavefront, you have access to the full power and flexibility of the Wavefront Query Language to visualize and alert using the analytics ts() functions that can be applied on your time series, metric histogram data.
What About Histogram Aggregation? Does it Work Across Sources, too?
Absolutely! While Wavefront constructs the metric histogram data points at the proxy, we can also combine and manipulate histogram time series after they’ve been forwarded to and stored in the Wavefront cloud. This means we can handle raw metric points that were later forwarded to the proxy, and using their timestamps, add them to their corresponding histogram points.
Along the same lines of flexibility, Wavefront can combine histogram points of the same metric that were sent to different proxies and combine those histograms from different sources for consolidated analytics. This is all possible because each histogram data point contains original distribution details, rather than just a simple time series of single metric values.
Customize Wavefront by VMware Histograms for Your Environment
Now that you know a bit about how the metric histogram data type works, it’s time to start thinking about how histograms can help you take your monitoring to the next level.
Of course, it starts with planning the right metrics for collection, and then where you need high-velocity measurements in particular. Use these metrics to baseline system behavior, before users start seeing a problem; using histograms here can be really helpful to get a full breadth of distributions and outliers. You’ll also be ready to dig deep into troublesome scenarios when they occur, and where outliers really matter. You can run and visualize various ts() percentile functions to see how your environment behaves during and after the incident.
By varying and viewing a wide range of percentiles, you’ll get a holistic picture of how your environment operates under different conditions. Then with analytics and the Wavefront Query Language, you can then start to identify the leading indicators of problems emerging across your entire environment. Any of these analytics can be easily turned into proactive alerts. And don’t forget here to make use of Wavefront’s ability to back-test new alerts before moving them to production.
Remember as detailed in the first blog of this blog series, when it comes to reporting high-velocity metrics about your applications and infrastructure – particularly those gathered across many distributed sources – your reported performance data may not tell you what you think it does. High-velocity metrics such as sub-second web requests and API response times are just a few that come to mind here. Use the Wavefront metric data type to avoid common monitoring illusions with simplified metric data that can distort your operational reality. Your users and customers will be happy you did.
Get started with Wavefront by VMware software today, and see what metrics histograms can you for your cloud applications.