For any given service—be it provided by software (e.g., application, platform), hardware (e.g., infrastructure), or human (e.g, delivery, support, documentation)—there is a level of reliability required to achieve user satisfaction.
While users, from end users of web or mobile applications to developers that use a platform, want to utilize a service's features, they care more about service reliability. After all, if the service is not working, they cannot make use of those features. Indeed, companies with unreliable systems suffer consequences. To that end, we consider reliability to be a core product feature.
The journey toward embracing reliability as a feature and achieving reliability targets requires more than simply scaling up or out aspects of a service. It begins by establishing meaningful and reasonable objectives, then adopting the tools and techniques to achieve them. It begins, in other words, with service-level objectives and service-level indicators.
Which service level?
Service-level objectives (SLOs) and service-level indicators (SLIs) are a set of practices for applying a product mindset and an economic model to service operations, respectively, are the foundation of reliability engineering.
SLOs are a threshold, a quantifiable target for a system’s behavior and the answer to the question, “How reliable do I want my service to be?” We can consider these to be representative of user expectations.
SLIs are a metric, a measure of a system’s existing behavior and the answer to the question, “How is my service performing at this point in time?” We can consider these to be representative of user experience.
So what about SLAs?
Service-level agreements (SLAs) are agreements between a service provider and a consumer that are contractual and binding. When SLAs are not met, financial or legal consequences apply. The key distinction between SLAs and SLOs is that agreements suggest retroactive penalties whereas objectives suggest proactive behavior (e.g., “Let’s fix this before our users get upset”).
SLAs often get confused with SLOs and SLIs. There are some great resources to help disambiguate the terms, but here’s another attempt, courtesy of my colleague Aram Price:
Suppose you’ve been on holiday and are heading to the rental agency to return the car you’ve been using on the trip. You remember that you signed a rental contract stipulating that you’ll return the car with the fuel tank at least three-quarters full or be charged a penalty.
Think of that contract as the SLA you have with the agency and the needle on the fuel gauge the SLI. Finally—and this illustrates something important—your SLO is that three-quarters of a tank. Any less, and the penalty kicks in; any more, and you’re simply wasting your money.
Why a service level?
We’ve established that reliability must be a core feature of our service. That means any change to the service introduces the possibility that reliability will be adversely impacted. We need a way to balance change with reliability. SLOs provide that way.
The cost associated with achieving greater reliability is exponential. Additionally, for almost any given service, as reliability approaches 100 percent, the likelihood that anybody notices drops substantially. The likelihood that anybody cares drops even faster.
Like the three-quarters-of-a-tank fuel target, there is a point at which any additional investment on our part does not result in user benefit or business value. Meanwhile, the investment we’ve made in order to achieve greater (yet unnoticed) reliability is no longer available for other work. So there’s an opportunity cost. It’s a bad investment.
An SLO should be considered as both an upper and a lower bound.
How to apply a service level
We tend to refer to SLOs in terms of “number of nines." A target of 99.9 percent uptime can be referred to as “three nines.” When applied to a window of time, a reliability SLO translates to what's known as an “error budget"—that is, an allowable amount of downtime or rate of failure. For example, an SLO of “three nines over a rolling 28-day window” means we have an error budget of approximately 40 minutes (0.1 percent of the number of minutes in 28 days).
As long as we’re achieving our objective, we have remaining budget, so we move fast and invest in new product/service capabilities. It’s only when we’re approaching the SLO threshold, and our error budget reaches zero, that we slow down and redirect resources toward greater reliability.
Now that we understand some of the key reliability engineering vernacular and why we should care, how do we begin to put this knowledge to use? A great way to get started is with an SLI/SLO workshop with your team and stakeholders. A workshop can be used to introduce these topics in detail, generate your own SLI and SLO definitions, and establish ways in which the team can use error budgets and SLO miss policies to balance development and operational velocity with reliability.
Interested? VMware facilitates SLI/SLO workshops with our customers to help them get started. If you’re ready to get started on your journey with the help of experts who are always within reach, read more about the offerings in our Tanzu portfolio and reach out to a sales representative today.