Build Next Gen Apps Cloud Migration VMware Tanzu Observability

What if I Told You Your Monitoring Data is Lying: Why Histograms are Critical for Accurate Reporting of Hi-Velocity Metrics (part 1 of 2)

When it comes to reporting high-velocity metrics about your applications and infrastructure – particularly those gathered across many distributed sources – your reported performance data may not tell you what you think it does. High-velocity metrics such as sub-second web requests and API response times are just a few that come to mind here. Developers and DevOps engineers can fall into the illusion of ignorant bliss with simplified metric data that actually distorts reality, unaware of the issues it can create for themselves and their enterprises’ business.

As in the 1999 film, The Matrix, there is a blue pill and red pill to choose from. When it comes to monitoring high-velocity metrics for modern distributed systems, some may not even realize that they’re already following a deceptive “blue-pill” way, while alternatively, there is also a much more accurate “red pill” way. More than ever before, with the era of cloud-native distributed applications, you need the awareness and tools to change your approach.

Today’s Highly Distributed Applications Require a More Accurate Monitoring Approach
Your approach needs to change because modern cloud-native distributed applications exhibit more dynamic and erratic behavior. To assure their continuous and ongoing performance, it’s becoming essential to analyze high-velocity (sub-second) metrics on distributed systems (microservices) that handle thousands, tens, or hundreds of thousands of requests per second.

The ability to analyze dynamic service or application performance at any level of detail, across multiple dimensions – by source, host, API call, or other – is a powerful troubleshooting capability for ensuring service reliability and SLA adherence; but, the sheer scale and performance required to process such granular, high-velocity metric data across your entire environment can easily overwhelm your infrastructure. Now, there’s a way to scale this, as you’ll read later in this blog.

Another problem in working with hi-velocity metrics is the accurate assessment of erratic behavior (depicted by outlier data), given the way many performance monitoring tools process averages, medians, and percentiles. The problems with monitoring this way are pretty well known; but, these troubles are exacerbated with high-velocity metrics (i.e. such metrics are often stored as longer term averages).

Not knowing what’s the true typical behavior or the usual behavior can drive up your incident resolution time. If you inaccurately report on what the majority of your customers experience, while they are actually experiencing degradations like sluggish page loads, you will be slow to respond to the incident; and, you risk them abandoning your service. Once customers churn, as research confirms, it’s hard to win them back.

As Morpheus explains to Neo in The Matrix, “to change something, you must be aware of it at first.” Such is the case with the deceptions of averages in monitoring data. But that’s where you can decide to instead take the red pill, and see the performance reality that most of your customers are experiencing. Let us first review a few common blue pill traps, and then introduce you to the Wavefront solution with its new histogram capabilities, your red pill alternative.

Blue Pill: Blissful Ignorance of the Averages Illusion
The first blue pill fallacy is the overreliance on averages (a.k.a. mean) for application performance metrics. While there has been lots of discussion about the flaws of averages, some people still fall victim to their inaccuracies (move along if you’re well versed on this topic). The issue with an average over time is that it’s expressed as a single number. As such, it masks the extremes, so you can’t see them. Likewise, extremes slant the average value toward that extreme, skewing it from what’s typical.

As an example, you may be monitoring a web latency metric averaged over some timeframe, but are trying to determine from that average value the highest latency that most of the website visitors are experiencing. An average latency value can’t show you that. Nor does it show how severe things sometimes get. Of course, averages have many relevant uses, but understanding how data is distributed is definitely not one of them.

So why do some people continue to rely so much on averages? It’s because averages are easily accessible, the basis for how most monitoring tools report on metrics, and easy to show to management. In a seminal HBR article, Sam Savage of Stanford University, pithily wrote, “Decisions based on averages are wrong on average.” If you recognize yourself in the above, take solace for now, as you’re not alone. The HBR article also describes several actual, critical system failures leading from major flooding all the way to billions of dollars in lost revenue, when people inappropriately used averages to describe a behavior that can only be described as a distribution.

Blue Pill: Blissful Ignorance of the Median Illusion
Another blue pill fallacy is relying on the median value (also known as the 50th percentile value) as the representative or “common case” for your application performance metrics. The median represents the midpoint of a distribution. But how relevant is it to report what half of your users are experiencing? Shouldn’t you be caring for what most users experience? And what about some of the users that are having an unusual (outlier) experience? OK, a median is one simple value (and therefore easy to process and store), but it’s also too simple.

Higher percentiles are better for understanding what most typically experience. To better understand outlier performance, you also need to monitor edge cases. So the ability to track 95th percentile (p95), 99th percentile (p99), or whatever pXX you want, is key for more accurate visibility and assessment. Ideally, you want to understand the full distribution of a metric, e.g., to assess what the majority of your website users are experiencing. We’ll shortly get to Wavefront’s histogram data type for scalable processing of a metric’s distribution.

Blue Pill: Blissful Ignorance of the Percentiles Illusion
Percentiles are often the proposed solution to overcome the problems of averages. The XXth percentile (pXX) is the max value of all the measurements in a distribution, excluding the 1 – XXth measurements. As an example, to get the 99th percentile, you exclude the worst 1%, and the p99 is the measurement of the max value of what remains. But before we simply conclude that percentiles are great, note that there are more wrong ways to work with percentiles than right.

In short, there are two key “gotchas” with percentiles. Let’s say you want to understand what’s the max latency that 99% of your customers experienced, and to see this you must gather and assess metrics aggregated from many different hosts.

The first gotcha is how do you accurately compute percentiles aggregated from many distributed sources. The second gotcha is what if you computed 95th percentiles of your metric, but later change your mind, and want to examine the 99th percentiles of the same aggregated metric. There’s a problem with that because most time series databases, if they work with percentiles, store them as aggregated metrics over timeframes, and not with the awareness of the full distribution of metric values originally gathered. Averaging percentiles (without re-summing distributions) is simply bad math and leads to wrong decisions. We saw a number of our customers struggle with these gotchas, and this further motivated us to enhance the Wavefront platform with the new histogram data type.

Red Pill: Knowledge and Power of Wavefront’s Histograms
Finally, we can describe the red pill. That is, you’re better off with the increased monitoring accuracy from using our new histogram capabilities. A little refresher on a histogram data type for metrics: it represents a distribution of measured values for a given metric over a specified timeframe. Ok, so what? Histograms, or distributions, convey much more information about how your application and infrastructure metrics behave. As well, they’re particularly useful for high-velocity latency measurements. The notion sounds intuitive, but in reality, it isn’t trivial to implement full analytics-driven visualizations and alerting on metric histograms.

One Wavefront customer, an e-commerce leader with millions of customers, first tried to develop a home-grown metrics monitoring solution based on open-source tools. Their performance team needed to understand the reliability of their API serving up to 200,000 requests per second, and whether they were meeting their critical SLAs. When incidents emerged, they also wanted to troubleshoot starting at their top-level services (as a whole) and then be able to quickly drill-down into the performance at any host-level. Since their monitoring pipeline is centered around cluster-level performance, if they received a cluster-level alert based on a percentile, they wanted to drill-down into the associated host-level percentiles and understand which hosts are running hot.

Due to the disbursed and voluminous nature of the high-velocity metrics to be processed, their performance team ran into network bandwidth limitations, processing limitations, and resulting averaging inaccuracies. It very soon became clear they needed a high-scale, high-velocity commercial solution, as their home-grown solution was taking too many resources to maintain and scale. The increasing complexity that came with their modern distributed cloud applications required them to care about both aggregate (system/cluster-level) performance, as well as drilling down into any individual host. Continuous, system-wide assessment lets them know if customers are getting negatively impacted. Drill-down capabilities down to individual hosts are essential to troubleshooting specific incidents.

Some of the initial tools this customer used – such as Ganglia and Telegraf – could only compute pre-defined percentiles from single sources (and we are seeing the same shortcomings with StatsD, Graphite and other basic commercial metric tools). Doing so didn’t allow combining and computing percentiles across sources, which they later realized, and caused them a lot of headaches due to acting on inaccurate data. They also realized that if they tried to simply combine percentiles, it resulted in false readings on key SLA and latency measurements. Again, false measurements led to bad decisions and unhappy customers.

The Wavefront platform now captures and computes distributions for all your high-velocity metrics, i.e., use our full query-driven analytics on your metrics using the histogram data type. Our software service also preserves and stores histogram data so that it can be processed later as you need. With Wavefront histogram support, you can:

  • Reliably measure and aggregate quantiles/percentiles of your high-velocity metrics such as application response times and services SLAs
  • Reliably measure and aggregate quantiles/percentiles of high-velocity metrics from multiple sources or other dimensions
  • Calculate aggregated percentiles across multiple sources
  • Wavefront’s histogram support is the most reliable and cost-efficient way to understand the true performance of highly distributed services. This includes the accurate assessment of high-velocity metrics, examples include: (a) the aggregated latency of an API service that’s hit with hundreds of thousands of requests per second, or (b) reporting and alerting off of 99th or 95th percentiles of latency metrics for external web services processing 50,000 or more events per second. Don’t be misled by monitoring’s blue pill falsehoods of averages, median, and percentiles. Choose monitoring’s red pill, the reality of truth, Wavefront’s histograms for high-velocity metrics at scale.

    In Wavefront’s part two of this blog series on histograms, Paul Clenahan, Wavefront’s VP of Product Management, will provide a deeper dive into our histogram technology and further explore how to use Wavefront histograms to improve your cloud application monitoring. Until then, try Wavefront.