Every engineering team must manage some level of operational load. But too much of it can get in the way of doing the important and engaging work that will make your organization—and your team—thrive.
VMware Customer Reliability Engineering (CRE) is no different. We are a team of site reliability engineers and program managers who work together with Tanzu customers and partner teams to learn and apply reliability engineering practices using our Tanzu portfolio of services. As part of our product engineering organization, VMware CRE is responsible for some reliability engineering-related features for Tanzu. We are also in the escalation path for our technical support teams to help our customers meet their reliability goals.
We recently completed an assessment of operational load on our VMware CRE team. As a result, we significantly reduced that load, improved our team well-being, and increased the amount of spare time and energy we have to invest in reliability engineering projects to improve Tanzu.
Operational load
You might be wondering what we mean by “operational load” since there are many perfectly reasonable yet subtly different definitions out there. For example, some companies prefer the term “toil.” We define operational load as follows:
Operational load is composed of a subset of unplanned tasks requiring manual intervention. The litmus test is if the act of performing such a task resolves a problem and possibly sharpens our skills, but it won’t help with addressing a whole set of issues in the future.
Motivation
Operational load doesn’t scale well. The team performing such work may end up having to grow as a function of the volume of work, number of customers, number of services, etc. While this is acceptable for many companies and roles, it is generally not a desirable way to scale site reliability engineering (SRE) teams. The SRE role, as the name implies, calls for some of its time to be dedicated to engineering reliability—usually into products, services, or with customers. We’d consider a team called SRE that’s just operating, or mostly operating as a goal, as an operational team instead. SREs are also hard to find, recruit, and train, given the role is relatively new.
While it is part of the role for SRE teams to perform some operational work, any team attempting to specialize in engineering reliability into products, production systems, or with their customers should avoid investing too much time on operational work for too long.
It may seem counterintuitive that an SRE team who are experiencing operational overload is likely to fail if the service they operate is a success. Once it is understood and accepted that operational load doesn’t scale well and one cannot, nor should, attempt to quickly hire many SREs, it becomes clear that the SRE team, like any other software engineering team, can be overwhelmed with “tech debt.” They become unable to perform both operations and engineering duties, and the operational load quickly consumes their time and energy. In such cases, engineers interested in a healthy mix of both types of work become increasingly frustrated.
After deciding on our definitions for “operational load,” as well as “too much time” and “too long,” the CRE team concluded that we had been investing more than our target ceiling of 50 percent of our time on operational load work (roughly 65 percent) for too long (at least a quarter).
Defining workload categories
To categorize your workload, it is a good idea to be as consistent as possible within the team and across your measurements.
Our categories were defined as follows:
-
Operational: reactive work (e.g. tickets, human) or systems-initiated (e.g. alerts).
-
Project: proactive work, e.g., automating manual processes, building team infrastructure, and other enduring reliability-related work.
The team converged on categorizing all of our time investments between “operational” and “project.” As a result, our categories are intentionally broad, and they may or may not work for you. For example, in an initial measurement, we separated “meetings” and assessed time invested in different kinds of meetings. In follow-up measurements, we accounted for meeting times as either a side effect of operational or project work.
We’ve also accounted for operational work being performed as a side effect of an engineering project as being in the “operational” category. This may bias the numbers toward operational load, but that’s what we wanted. We were looking for the highest amount of operational load we could possibly have.
Assessing a team’s current workload
Once you have your categories defined, it is time to assess the current workload. We did so via team surveys and opting not to share individual responses with the entire team in order to preserve and support psychological safety.
The team received a report with aggregated results and an action plan since it was clear we were in operational overload. Our goal was to have less than 50 percent of our time invested in operational work and we were at 65 percent.
Establishing an action plan
Establishing a proper action plan is heavily dependent on the kind of work your team has been performing. We had the opportunity to roll out two significant changes that reduced our operational load from approximately 65 percent to 35 percent in less than a quarter, exceeding our expectations.
Align team’s work with a charter approved by your stakeholders – We started with a full assessment of what was in scope for our team, ceasing activities that no longer were a fit for our team’s charter. If you don’t have a team charter approved by your stakeholders, you may need one before attempting this. It may seem obvious, but the work isn’t as trivial as stopping to engage with projects that are no longer a match. It may require an exit plan, including convincing peer teams of alternatives that do not involve your team having to perform certain types of work.
Defragment the team’s schedule to minimize workflow interruptions – Some of the most detrimental things for effective engineering work (or any work, really) are fragmented schedules and, generally speaking, interruptions. Context switching is expensive. As part of this project, we reviewed our team meetings to defragment our schedule. Early in our assessment, we realized our team meetings had grown organically. It resulted in too many short meetings instead of just enough meetings that covered several topics of discussion, with a recurring agenda and room for new items.
We also established an “interrupts rotation.” You can think of it as a rotation for operational work and other kinds of interruptions. We previously had an on-call rotation for production outages.
Our interrupts rotation started as a week-long shift where only one engineer was interruptible at a time. Everybody else was not reachable by customers or partner teams, and was free to perform proactive project work. In practice, our interruptible engineer acted as our first line of defense for team interruptions. We still had to bring in some other team members to assist with customer engagements on occasion. For example, one of our secret weapons is the fact that we have amazing program managers who help us with continuous improvement of existing workflows, establish new processes where there were none, and identify and eliminate unnecessary work altogether.
These improvements were sufficient to get us out of operational overload, and we commonly see similar automation work and process improvements playing a key transformation role for other teams. Our team had a further improvement in mind…
Keep iterating
While most of our engineers started to dedicate more time to project engineering work and proactive customer-facing activities, we realized that they were still working on closing some lingering issues after a week-long interrupts shift. The team then suggested formalizing that.
We now have a primary and secondary interrupts schedule where, by default, the secondary on a given week acted as the primary from the previous week. Our interrupts rotation is still intentionally separate from our on-call dedicated to production outages, but we will keep on iterating and are set to run our survey for a third time soon.
We recommend running your own time investment survey (see example below) at least once a quarter, but that’s arbitrary and there is probably low value in running it a second time if you haven’t executed on an action plan after the first run.
Time investment survey
Here is a summarized version of our time investment survey:
-
If you were to categorize all of your work between operational and projects since the last survey (or in the past three months if this is your first survey), what percentage would you assign to operational work?
-
What’s your top source of operational work?
-
System-initiated (e.g. alerts)
-
Human (e.g. human-generated tickets)
-
-
Do you have any other sources of operational work besides alerts and tickets? If yes, which one(s)?
-
From 0 to 10 (with 10 being the happiest), how satisfied are you with the quantity of time you spend on tickets?
-
Project work
-
What are the top 3 things that could improve your workflow while performing project work?
-
Do you have a long-term engineering project in your quarterly objectives?
-
Now that you understand a bit more about operational load and why we should care, how does one begin to put this knowledge to use? VMware facilitates SRE-related workshops with our customers to help them get started adopting reliability engineering practices; read more about the offerings in our Tanzu portfolio and reach out to a sales representative today.
Alexandra McCoy, Corey Innis, Kalai Wei, Kam Kyrala, and Kimberly Embry contributed to this post.