The situation is tense: people yelling at one another, monitors flashing with numbers, charts going in all sorts of directions — the pressure is on to figure out what’s going on and how to fix it. It is, perhaps, just another day at the ICU of a typical hospital, where, every day, like clockwork, teams of medical professionals make split-second decisions that can determine whether a person can live to see another day or not. In many cases, tests or panels would have taken hours or days to complete, and searching on the internet for a clue or visiting the patient’s house (a la “Dr. House”) is certainly out of the question.
Remarkably, decisions are often made with a quick glance or a recollection of the patient’s medical history. That, combined with the experience and intuition of the attending physician, and their interpretation of the numbers that we all seem to be intrigued by — those blinking digits displayed on TRON-like monitors tracking the status of virtually every ICU patient — enables them to make the right call. It is these vital signs (ECG/EKG, heart-rate, blood pressure, PAP, SpO2, RR, etc.) that allow medical professionals to gauge the state of patients and, perhaps more importantly, anticipate what they are going to experience next.
In recent years, developers increasingly ask the same questions: What do the vitals of my application look like? How were they in the last hour? In the last day? In the last week? Last month? And if they aren’t looking good, what are the tools at my disposal to track down the cause of the aberration? Is there room for better performance? Where should I focus my energy?
More importantly, digital businesses or traditional industries are beginning to examine their applications, architectures and infrastructure pieces to ask similar questions: How are the vitals of my business and is there something that doesn’t look right? How do I get to the root cause? How can we control cost by getting the most out of our current hardware? This led to an emergence of solutions termed “Application Performance Management” (APM), which can be summed up as the “translation of IT metrics into business meaning (value)” [1]
At Wavefront, we are in the midst of a rapid shift in how we can understand and manipulate IT metrics. Gone are the days when the only thing of concern is whether a machine or an application is up and running. Gone, too, are the days when businesses can discover and react to a crisis days or weeks after an issue starts to crop up because reports were run, at most, on a daily basis. IT organizations of the future will require rapid insights into the complete lifecycle of user engagement, from how a user interacts with an application on their phones, to the back-end systems that are processing the user’s order and ensuring that all business processes are orchestrated properly. However, how can a product owner, an engineer, an ops person, a salesperson, or an analyst look at, manipulate, understand, and rightly interpret the diverse set of metrics emitted from an even more diverse set of abstractions (applications, containers, virtual machines, hardware, interconnects, etc.) when they are often siloed across specialized systems, encoded in vendor-specific formats, or sadly, not even collected in the first place?
One System to Rule All Metrics
Ian Malpass once said that at Etsy, ““if it moves, we track it” [2]. He also said that metrics from every level, “network, machine, and application” must be measured, even the most mundane one “in case it decides to make a run for it. “In recent years, many open source efforts have made great strides into conforming the data coming from the various abstractions of a modern application into machine understandable, annotated time series data. There was never a time in the history of computing when virtually every coding language has at least one or more libraries for tracking and emitting telemetry data. At Wavefront, we have been using home-grown network/machine monitoring agents, as well as collectd and AWS CloudWatch to gain insights into the machine and the networking fabric of our clusters. As a Java shop, we are also big fans of Dropwizard Metrics and HDRHistogram, which allows us to emit telemetry data from every measurable aspect of our application into Wavefront. We have recently published a Dropwizard to Wavefront library (https://github.com/wavefrontHQ/java/tree/master/dropwizard-metrics/3.1) that allows anyone who is using Dropwizard Metrics to have his/her data exported to Wavefront in a simlar way (with early support for Dropwizard Metrics 4.0). From our combined experience of running complex distributed applications across many machines in past careers, we have come to appreciate the power and necessity for application metrics to be collected, examined, and understood in order for the vitals of an application — and of the business — to be monitored and alerted upon. Wavefront is the platform by which all telemetry data, whether it be machine telemetry, AWS CloudWatch data, external monitoring probes and application metrics, are collected, stored, analyzed, visualized, manipulated, and alerted upon (we call it “Wavefront on Wavefront”).
What We Track Ourselves
At Wavefront, we track metrics that are either counters, gauges or histograms within our code. Counters are monotonically increasingly longs that are only reset when a JVM is restarted. A reset-sensitive first-derivative function is included in the Wavefront language which allows a counter be to converted to a rate. Gauges are, as the name suggests, a measurement of a numerical nature that can be considered a sampling of an output at a given time. Gauges are useful to track the number of workers active in a thread pool, the size of caches, the amount of memory used by a particular generation of the heap or the number of open file descriptors. Histograms are used to track the distribution of values. While Dropwizard Metrics offers a single-node estimation of submitted samples, and allows for the tracking of percentiles and statistical summaries, we are also experimenting with using HDRHistogram, and combining collected histograms across processes to obtain bucketed histograms that allow us to compute true p99 latencies of any span of code or processing times.
Distilling the Vitals of Our Application and Business
Perhaps as a corollary to “if it moves, we track it”, at Wavefront, “if it has meaning, we track it.” This allows us to pick out the critical code paths that different services in our architecture runs through billions of times a day and display them on dashboards, measuring the rate, the latencies, the variance of the same measurement across machines, comparisons against data from the day before, any number of statistical moving functions, etc.
At Wavefront, the two top-level vitals are:
1. The Rate of Telemetry Ingestion
2. The Latency of Query Streams
Such vital signs are often the first telltale sign that a deployment has gone awry and should be rolled back. For instance, in our ingestion pipeline, we try our very best to never lose a point™ after our proxy agent receives it in our customers’ own data centers. Through a robust pipeline with local memory buffering, on-disk buffering, queue infrastructures, etc. we delight in seeing a stable and flat rate of ingestion which tells us that metrics are lining up, humming across the wire and ultimately landing in durable, replicated storage (encrypted of course).
The vitals of the ingestion pipeline are the mechanisms we employ so that we are not overloading any component at any given point. We call this flow control or pushback. As we have discovered, tracking the aggregate pushback rate of our edge collectors is an important vital sign that we look at to quickly gauge the health of our application. We know that nobody likes delayed metrics, gaps in live charts and misfired alerts due to missing data. When a deployment happens, all eyes are on the vitals for that cluster and at the first sign of trouble, the rollback portion of the deployment runbook is executed.
By feeding data from monitoring systems such as Pingdom, ThousandEyes, Runscope, AWS Route53, AWS CloudWatch as well as our own deployed agents into Wavefront, we have complete active and passive monitoring of our system, so much so that machine metrics, network packet losses or even the quintessentials such as machine or service down alerts are not so much a concern as long as we have insight into the health of the application and the user experience. Given the fluid nature of modern deployments and the proliferation of IaaS, it is perhaps more important that the impact of machine failures, software crashes are put in the proper context of their impact upon application or business vitals so that every person in the organization can understand what’s at risk and prioritize work accordingly.
Finding Root Causes and Bottlenecks
By collecting meaningful metrics from our stack, finding root causes and locating bottlenecks is a breeze thanks to the lightning fast query engine within Wavefront. With the power bestowed by a language that is inspired by similar systems at Google and Twitter, correlations are examined across hundreds of thousands of series in seconds, and hypothesis posed and confirmed instantly during troubleshooting, or, as I like to call it, when one “makes the rounds” with the clusters trying to understand why a certain vital for one of them is changing in a specific way. The social nature of charts and dashboards also allow for real-time collaboration between oncall and developers in case a particular concern cannot be adequately addressed by oncall looking at the available dashboards alone. Such information are often proliferated, curated, written up, and new dashboards, runbooks and alerts setup so that we can better understand the context in the future as to what is going on. We are especially fond of the un-paged INFO/SMOKE alerts that generate source-specific alerts on relevant charts so that root causes that occur often enough are immediately annotated on related charts. The more the team shares that understanding by means of auto-generated annotations, the easier it becomes for troubleshooting.
Bonus Superpowers
Perhaps one of an unintended consequence of having the power to emit any and all telemetry data, and a query engine that can execute complex calculations and transformations, is the ability to actually model algorithms in Wavefront (and ask “what-ifs” without having to deploy any code). As a personal anecdote, I remember having to develop the heuristics for a particular algorithm used by a component called (I shall refer to its codename) BCS. It happens to be a piece of code that required me to tune the logic behind how fast data flows in and out of a distributed worker pool with each worker responsible for a region of data. Having a previous version of the code running in production and a naive implementation of the logic, I was able to model “what-if” scenarios by combining the collected stats and generating synthetic time series data (a powerful aspect of the Wavefront query language) to come to an equation that would “empirically” produce the intended behavior. In the past, such endeavors may involve many rounds of setting up test clusters, generating similar loads against such clusters, exposing all relevant knobs in configuration files and testing and adjusting them individually (often realizing that a code change is required along the way).
At Wavefront, we believe it is an entirely new dimension to improving the performance of applications when developers can interrogate production telemetry data and discover new ways to tackle problems, identify hotspots and even virtually test hypothesis and algorithms, all within the most powerful and scalable analytics platform available in the market today.
Conclusion
Whether you are coming from the world of traditional monitoring looking for something that’s more powerful and flexible than the tools of the past or someone who’s embracing the “metrics-first” movement and looking to find a home for the metrics that you always wanted to have collected and analyzed, Wavefront is the most powerful and scalable real-time unified analytics platform that can fulfill your needs today. Contact us to try it out!