healthwatch operations PCF Pivotal Cloud Foundry platform_operations service level indicators service level objectives sre

Thinking in Error Budgets: How Pivotal’s Cloud Ops Team Used Service Level Objectives and Other Modern SRE Practices to Improve Outcomes

This is the second post in a series about monitoring distributed systems. We discuss a pragmatic example of how Pivotal Cloud Ops leverages the concepts of Service Level Objectives and Error Budgets to make management decisions about PCF. These concepts were introduced in a prior post.

This post was co-written by Amber Alston and Deborah Wood.

Pivotal Cloud Foundry (PCF) is a proven solution for companies that desire improved developer productivity and operational efficiency. Pivotal provides the platform and also partners with our customers to develop and refine best practices for running cloud platforms at scale. 

These practices are battle-tested through Pivotal’s own internal PCF Operators. These engineers monitor and manage the large-scale PCF deployments that run the applications and services that power our globally-distributed employees.

Operationalizing PCF at Pivotal

If you paired with our Pivotal Cloud Operations (“Cloud Ops”) teams a few years ago, things would be familiar to you.

You’d find a monitoring dashboard with a high volume of graphs, populated by the metrics emitted from the platform. While useful, the amount of data could be overwhelming. Our selected monitoring tooling was suffering from dozens of abandoned dashboards.

Over time, the responsibilities of the Cloud Ops team continued to grow. Consequently, we had to make operational decision-making more efficient.  “Is this continued graph spike important enough for me to page someone? Should I hold this release today, since we had some downtime for end-users yesterday?” There are dozens of questions like this an enterprise operations team faces every day.

The Pivotal Cloud Ops turned to the core principles of Site Reliability Engineering (SRE) to help modernize its thinking. The results so far are great for users, which include customers pushing their applications to Pivotal Web Services, and Pivotal employees running code on internally hosted PCF foundations. It’s also helped to decrease the team’s stress during reactive incidents. These simple principles have made the most impact on the team:

  • Prioritize Service Level Indicators over lower-value metrics. If paged for a significant event, the team has a clearer understanding of the end-user impacts and their severity.

  • Establish, measure and publish Service Level Objectives. Visibility into performance against these objectives facilitates shared conversations within Pivotal around balancing investments across innovation and reliability.

  • Inform deployment decisions using Error Budgets. The team’s comfort with deploying change is increased knowing any downtime risk is still within an acceptable level of downtime whereby end-users will still be satisfied with the overall platform performance.

  • Agree to a clear “contract” with platform tenants. App developers that deploy code to the platform (i.e. tenants) have a clear understanding of availability guarantees. They always know when – and for how long the platform has been unavailable. This enables the tenants to make appropriate architecture decisions for their apps, given their respective reliability targets.

All of these principles sound attractive. But how do they work in practice? Let’s take a closer look at how these principles are applied at Pivotal.

What follows is a case study from Cloud Ops, a team that manages one of our most important tools.

Let’s Set Some Context: Pivotal Tracker on PCF

Pivotal Tracker is a Software as a Service project management application. Modern software teams use it to work collaboratively, and deliver continuously. Customers create an average of 70,000 stories/comments a day. The product serves an average of 400 requests per second. Pivotal Tracker is also the go-to tool for Pivotal’s globally-distributed employees. We depend on this tool too!

A year ago, Pivotal Tracker was migrated from AWS to PCF. This shift decoupled the operational burden from the Tracker Application Development Team.

Today, Pivotal Tracker’s production environments are distributed across two PCF installations on Google Cloud Platform (GCP). This production infrastructure is operated by two Pivotal teams, one in Dublin Ireland, and the other Denver, Colorado.

The Dublin team, Cloud Ops, ensures the health and availability of the PCF installations that host Pivotal Tracker. They also look after the health of supporting backing services, such as Redis for PCF.

This team is composed of a product manager and 4 to 6 platform engineers. Daily responsibilities include daily back-ups, stemcell patching, and performing upgrades. Of course, everyone on the team “holds the pager” to triage incidents.

The Denver-based team, typically a pair, have operational responsibility for managing the BOSH releases of the specialized services required by Pivotal Tracker, such as memcached, SOLR and tiny-proxy. The duo is co-located with the Pivotal Tracker development team, so they also perform support activities for application issues.

That’s the background. We’ve got a business-critical system operated by two teams, separated by several time zones. What does it all mean? It means it’s imperative that each team – and the larger business – agree on a shared definition of reliability for Pivotal Tracker, and the underlying PCF platform.

Next, let’s talk about our definition of “reliable.”

Service Level Objectives & Error Budgets for Tracker on PCF

Pivotal Tracker does not explicitly have a Service Level Agreement. But as part of our Cloud Ops principles for operationalizing PCF, the teams came together to define their shared Service Level Objectives for the product’s most important workflows.

The Cloud Ops team wanted the PCF platform to appropriately support two consistencies: the developers at Pivotal that build new features for Tracker; and Tracker’s worldwide user base.

We have the following Service Level Indicators for the Tracker production environment:

  • Tracker App Availability SLI: Whether or not a known production URL of Pivotal Tracker is serving the expected content. The content should be returned within an acceptable, pre-defined time.

  • Canary App SLI: Whether or not the Apps Manager application is being successfully served by the PCF Platform in production. Responses should be returned within an acceptable, pre-defined time. If this is not the case for this application, which is bundled in the PCF install, it is highly likely that other applications are impaired.

  • CF Push SLI: Whether or not it is possible to successfully log in to the CF API, and execute a CF Push command, within the expected 2-minute window. The SLI measures whether app developers, such as those developing new Pivotal Tracker features, can push updates to their applications in the production environment.

James, a CloudOps engineer, wearing the train driver hat. Note PCF Healthwatch on wall.

The Cloud Ops team relies on Google Cloud Load Balancer Logs, and PCF Healthwatch to execute these measurements and emit the results. The team currently aggregates this data externally, leveraging a script to calculate the rolling 30-day performance value in order to display this information via an application UI that is visible within their work area.  

The internal Service Level Objective (SLO) for the first two SLIs is currently targeted as “Three Nines”, or 99.9%. That means the SLI must be true 99.9% of the time over a rolling 30-day window. The last SLI has an SLO of “Two Nines”, i.e. meaning that the SLI must be true 99% of the time over a rolling 30-day window.

In SRE methodology, a SLO directly translates into an Error Budget. By defining 99.9%, we have determined that it is acceptable to not meet the SLI 0.1% of the time, or approximately 43.2 minutes over a 30-day rolling window, and our users will still be satisfied.

A good real-world analogy for error budgets and user satisfaction impacts is to think about a well-run train system. Trains are expected to be “available”, or on-time, all of the time, but it is unrealistic that they can be 100% reliable at all times. There are other factors – some unpredictable (weather, congestion, staff issues) and some predictable (purposeful track or train maintenance) – that may cause delays, leading to an overall on-time percentage calculation. By understanding the needs of the ridership, operations understands that the trains can be less than 100% on-time. If they are on-time 99% of the time, train riders will generally be satisfied with the train service provided. If some of that 1% was spent on purposeful maintenance, like cleaning and updating the trains, riders will be even more satisfied!

The SRE principles also state that, like a fiscal budget, an operational error budget is intended to be spent. When the Cloud Ops team knows they have consumed 20 minutes of outage so far in this window, they also recognize that they still have 23 minutes of downtime left. This time can be spent deploying new features and/or desired security patches, while still being able to deliver on the promised reliability of the overall platform service.

Balancing Speed and Stability: Gating on Error Budget

By leveraging Service Level Indicators, and co-defining target Service Level Objectives, the Pivotal Tracker and Cloud Ops teams have created a shared language to evaluate and discuss the stability and performance of their applications on PCF.

In this section, we’ll continue to explore how the Cloud Ops team pragmatically applies these concepts to manage how change is executed to the platform.

On Mondays, the Cloud Ops team meets to plan their work for the week. Engineers also review recent performance data, and the status of the three Pivotal Tracker SLOs.

A typical workday may include planning the deployment of new products into production, deploying upgrades to existing products, and making necessary configuration changes to suit evolving use cases of the platform. All three types of updates are highly automated. (If you’d like to learn how Cloud Ops mitigated the risk from Spectre and Meltdown, check out this blog post.)

Team standup board, viewed every morning showing Canary app & CF CLI SLI for last 30 days.

The possibility of change in production systems, even when automated, is always gated by the amount of error budget remaining (on a rolling 30-day average) for each established SLI.

That’s why the team will spend part of the weekly meeting looking at any spikes in the error budget burndown, ensuring the root of an outage issue is understood, and if not, what action items are necessary as follow-up.

Tracker SLI error budget burn down in minutes – SLO is red line above

The team then considers how much error budget is currently left. Then, the group decides whether or not to defer some, or all, deployments until the error budget is sufficiently recovered.

If Error Budget is on Target…

If sufficient error budget amount remains, the team usually determines it is safe to unpause the automated deployment pipelines and/or plan to manually deploy. The team will then consider two additional factors in their deployment planning for the upcoming week.

  • Does this product result in any known downtime?

  • How long does this product traditionally take to deploy in this environment?

As an example, on this particular environment, the Pivotal Application Services (PAS) product and the Redis for PCF product both take several hours to successfully deploy and validate. Because the Cloud Ops team has a work-life-balance goal of keeping deployments and pages within normal working hours if possible, a methodology they call “DayOps”, they schedule these product deployments to occur on different work days. This type of context has come to the team organically through experience. Engineers also keep easily referenceable deployment notes on all upgrades.

If Error Budget is Not on Target…

If an insufficient error budget amount remains, the team then makes a shared choice to prioritize stability over innovation, and the deployment pipelines are paused for the week.

As an example of a recent purposeful deployment pause, Cloud Ops was asked by their business stakeholders to upgrade to a newer PAS patch version. However, the error budget display showed that they had actually overspent their error budget amount slightly, due to a recent outage.

These kinds of discussions can get heated – but they didn’t in this case. With an established agreement, and a shared language around talking about the impacts to user satisfaction when the error budget is exceeded, the team could easily explain why the upgrade could not occur at this time. It also allowed the business and the Cloud Ops team to collectively make a decision that everyone felt comfortable with. In this case, the deployments were paused until 10 minutes – approximately a week’s worth of error budget – had been recovered.

Stand on the Shoulders of Giants, and Embrace Modern Operational Thinking

The era of distributed systems requires a reset across the board in enterprise IT. Developers, InfoSec teams, architects, CIOs, and yes, IT operations. The good news is that you have a playbook to follow: Site Reliability Engineering. We’ve embraced these ideas across Pivotal, and are helping our customers adopt them as well. We trust this case study on our internal efforts can inform how you improve business outcomes.

Working in IT operations has always been a tough job. But with the right approach and a shared understanding, life gets that much easier.

Want to learn more about modern monitoring practices? Join us at SpringOne Platform in Washington, D.C., September 24 to 27, 2018. Register now!

Recommended Reading

Special thanks to the Pivotal teams that contributed to this article, Pivotal Tracker and Cloud Ops!