This is the first post in a series about monitoring distributed systems. We introduce several important concepts for readers who may be newer to the topic.
Modern software engineering teams seem to have their own language. Two terms that folks seem to be using a lot lately? Service Level Indicator (SLI) and Error Budgets.
We use these terms a lot within Pivotal, and with our customers. Based on dozens of conversations, we thought it might be useful to put together a primer of sort. We wanted to define what these terms really mean when we talk about platform observability and management goals.
Terminology
Let’s start with some definitions.
KPI (Key Performance Indicator). A given metric, usually in the form of a counter or gauge value. The metric helps convey the health/status, performance, or usage of a given component or a set of related components.
SLI (Service Level Indicator). A “derived result measurement” from a purposeful validation test. An SLI has the goal of confirming that a specific, high-value user workflow is both available and acceptably performant for your end-users. You can think of an SLI as a measurement of your user’s expectations. For example, the end-users of my API would expect it to be available and return the requested response within 10 seconds. If my API fails to respond (or responds more slowly), my end-users will be unhappy with my API service.
Measurements of user expectations should be written in plain, easily-understood language. SLIs should be agreed upon by the entire team. Further, SLIs should also be coded with something that can programmatically measure them.
SLO (Service Level Objective). A threshold you established for your defined SLI, i.e. the percentage of your SLI testing that must pass for your users to be generally satisfied with your service. When defining your SLO target, it helps to think about what happens when your service doesn’t meet its defined SLI. If your SLI measures an internal-facing business enablement service, whereby brief outages may be more acceptable, you can potentially choose a lower SLO target than a service used in a popular customer-facing application.
Error Budget. Directly related to your target SLO percentage. Your Error Budget represents the quantified amount of downtime, or lowered performance levels, that you are willing to allow within a rolling 30-day window. Each time your SLI check fails, you are consuming some of your allowed error budget. (See below for a reference chart showing how SLO Target Percentage maps to an Error Budget value.)
By now, you might wonder how all these terms relate to each other. Let’s examine how a platform engineer may use these terms in the real world.
Example: A SLI for Pivotal Application Service (PAS), part of PCF
This is a Service Level Indicator used by an internal PCF Ops team.
SLI: As an App Developer, I expect to be able to successfully CF Push my app within 2 minutes.
Availability Measure | SLO Target | Error Budget per 30 days | Unit of Measure |
---|---|---|---|
CF Push Availability | 99.9% | 43.2 minutes | Success/Fail metric from PCF Healthwatch |
-
If availability is *above* our SLO target – yay! we release new features
-
If availability is *below* our SLO target – we halt releases, and focus on reliability
Deploying new code, or environmental configuration change, into production environments always carries some inherent risk no matter how well these changes have been tested. As long as we are within our SLO target, we have created the space to accept some level of downtime risk in exchange for the benefits of deploying desired features.
SLIs or KPIs? It’s Really SLIs *AND* KPIs
SLI monitoring is a more meaningful measure of user impact. That does not mean however, that you should stop monitoring the critical metrics your system emits about its own performance. Key performance metrics (KPIs) can be important for deeper troubleshooting. They are also useful indicators for things like the need to increase resources for a given component.
Think about the difference between KPIs and SLIs this way:
KPIs often change as the system changes. If your underlying system architecture changed, it would be expected for the KPIs of high operational value to also change with those components.
To wit: Pivotal Cloud Foundry. We modify and improve the platform every quarter. As a result, the KPIs that our customers need to care about are slightly different in each version. So we update the PCF Healthwatch product on the same cadence. The most relevant KPIs are always visible to platform engineers in the dashboard.
SLIs should not change if user needs are the same. As a representation of user value, and not the underlying technology, SLIs should be architecture agnostic. Assuming your system purpose stays the same, an existing SLI should remain valid through any underlying re-architecture.
Why SLI is the Preferred First-Level Monitoring Option
Focusing on SLI monitoring allows you to reduce the overall amount of monitoring work for your system.
Let’s consider another example. If our Pivotal Ops team is paged for the CF Push SLI test repeatedly failing, our SLO goals are at risk. Including this level of detail is instantly more meaningful; the on-call engineer immediately understands the end-user impact. If I’m on call, I don’t want to be paged in the middle of the night for some noticeable latency spikes in component metrics. But I *do* want to be paged if my Application Uptime SLI (via Canary testing) starts failing.
One continuously-run, user-functionality measurement test can tell you much more about the underlying performance of your system than monitoring and alerting on dozens of metrics in isolation.
Choosing an Error Budget: It’s a Balance
What should your Error Budget be? To figure that out, we need a Target SLO first. Selecting a Target SLO, often referred to in terms like “three nines” or “five nines”, directly correlates to your error budget in a given time window (see reference chart below).
At Pivotal, we measured our own systems for SLO adherence for many months. We have found it significantly more meaningful to talk in terms of the Error Budget left/spent. In our experience, humans can more quickly understand the impact of “time consumed” and “time remaining” instead of adherence to a displayed percentage number. We have subsequently shifted our own internal communications to talk more about how much error budget consumption is acceptable to “trade” for deploying a big or risky change.
It may be tempting to slow the pace of change–if I don’t deploy then I don’t risk my error budget. This must be avoided as it quickly leads to stagnation of the platform.
Teams need to be able to deploy new features into the system; security patches are critically necessary. Any sort of update introduces the possibility of instability. Use error budget conversations to set realistic expectations. You should publish these measures and targets freely within your organization. For public apps, consider posting your goals externally for your customers. Without these, users often default to an unrealistic expectation of 100% system reliability.
By establishing shared SLO Targets & Error Budgets as an organization, you put the focus on the right balance between innovation and reliability. And you create a shared language for prioritized investments.
This shared language helps in another scenario. If you’re frequently violating your SLO and consuming an agreed error budget, your teams can have a meaningful discussion around the need for additional investments. A common output is to expand efforts that increase resiliency and performance. Or perhaps your team will consider the feasibility of lowering agreed-upon reliability objectives. This would make more space for riskier, but necessary innovation work.
References
Recommended Reading
Selecting a Target SLO
Target SLO | Allowable Downtime (per 30 days) | Likely Requires |
---|---|---|
99.999% (5 nines) | 0.43 minutes | Automated Failover |
99.99% (4 nines) | 4.32 minutes | Automated Rollback |
99.95% (3.5 nines) | 21.6 minutes | |
99.9% (3 nines) | 43.2 minutes | Comprehensive monitoring and on-call system in place |
99.5% (2.5 nines) | 216 minutes (~3.5 hours) | |
99% (2 nines) | 432 minutes (~7 hours) |