By Arun Thundyill Saseendran
Enterprise platform software systems require reliability, robustness, high resiliency, and high availability. The platform teams in the Office of the CTO organization are responsible for building and managing platform services for VMware that other software systems use as building blocks and hence have a stringent requirement for high availability and resiliency.
99.99% (4 9’s) of availability is a standard, which means at max, a software system can be unavailable for 4 minutes and 22 seconds in a month. And we strive for 99.999% (5 9’s), which means a software system can be down for a max of 6 seconds in a month. Often, DevOps is a suggested solution for such scenarios. Does embracing DevOps alone create the magic? Possibly not!
VMware Platform Services embraces DevOps to its fullest. We have automated pipelines for almost everything. We refine and make the pipelines robust with multiple gates of scrutiny (static code analysis, unit testing, integration testing, and end-to-end testing) so that the team has the confidence to release microservices multiple times in a day. We configured the deployed systems with active log aggregation and monitoring: alerts are triggered proactively so engineers can act on the problem proactively, and there is no downtime.
One other philosophy we embrace is the mantra of full-spec software development, where we, as one team, take care of all the aspects of a software development lifecycle.
Full-spec software is the practice of designing and writing code with all the attributes of enterprise-grade software, such as reliability, resiliency, security, testability, operability, accessibility, security, compliance, privacy, and more in mind. In other words, ensure that all these attributes are built-in, from requirements to specification to coding.
In an ideal scenario, teams that practice and embrace the philosophies of full-spec software and DevOps should make the software systems highly resilient and highly available. However, there are some cases where an operational practice (a semi-automated process with human oversight) must be run to keep the systems and processes going.
Is that enough?
One of the questions often raised to teams is, “Is that enough?” True, teams follow the best practices of full-spec software development, embrace DevOps, and truly value automation, but does that mean no incidents will happen? Does that mean no manual intervention will be required? Automating everything is the ideal end state; however, there will surely be things that require executing operational practices one way or the other, and it is vital to be prepared.
The hidden potholes with the potential for catastrophe
In cases where DevOps is embraced, and most things happen with the touch of a button, engineers have little to no practice with operational processes because they are rarely used in a fully automated situation. I am sure many DevOps teams have come across scenarios like this where operational practices are required, and the whole team relies on one person, or maybe no one is equipped with the knowledge to perform them. That’s a very bad situation to be in.
Also, when the execution of an operational practice is needed in rare cases, if it’s not executed well, this could result in customer downtime. Even an entire fleet of dependent software could go down. This is like driving a highly reliable car for ages that you have forgotten to maintain—you’ve even forgotten how to open the hood during the scenario when the car breaks down. Therefore, you cannot just top-up the coolant because no one driving the car knows how.
How to be prepared for potholes?
So how do we prepare for such break-down scenarios? How do we ensure that engineers are ready to face and overcome these scenarios that require the knowledge of operational practices?
The solution to this can be inspired by other fields which face analogous scenarios—for example, the military. The security and safety of a country are their responsibility. A threat that requires on-field intervention happens rarely, yet every soldier must be ready to face such a threat. And how do they do it? The solution that has been tried and tested is to perform drills regularly. Drills help them stay active and prepared for an on-field intervention anytime. So is the case of martial arts. Practitioners don’t always have threats to fight against; however, a system of individual training exercises helps them be prepared for a threat. We’ll use Kata as an example.
“We at VMware have drawn inspiration from the drills and katas, added some fun to it (engineers love fun!), and made GameDay a framework for getting the teams prepared for unexpected potholes down the road.”
What is a GameDay?
GameDay is a planned team event that happens regularly, where the team gets together and performs drills on operational practices (often written as a runbook from past experience), asking questions, refining the runbook, and rectifying mistakes, if any. It helps the team be prepared for rare operational events and enables every team member to have the confidence and capability to keep the car running even on the most challenging roads.
The GameDay Framework
The framework describes two stages: when an incident happens in production and when it occurs during a GameDay.
When an incident happens in production:
- An incident is ideally triggered by a monitoring system or when a user faces an issue and creates an incident report.
- Once the incident is available in the queue, it is analyzed first to gauge the customer impact. If the issue is with configuration, then the user is advised to make changes and reference the user manual. Otherwise, if there is an automated fix available, it is either auto-triggered by the incident management system or triggered by an engineer.
- If the assessment leads to manual intervention, the first step is to look for a well-defined, tested (already gone through a GameDay) runbook procedure for the scenario and execute it. For a well-matured team, most incidents are resolved at this stage with no or very minimal customer impact.
- In cases where no runbook is available, the person on call ideally contacts the team for help. The first action is to mitigate customer impact. We design systems with deployment in multiple zones across multiple regions. So if a problem is identified in a region, traffic is diverted to other available regions to minimize customer impact, and then the teams work on a solution.
The timeline and resolution of an incident are always documented. Within 48 hours of the incident, a post-incident review is conducted. Here, all the stakeholders (including the affected customers) are invited, and the timeline of events, the solution implemented, preventive measures planned for the incident to not repeat (if possible), and long-term actions to proactively fix issues are explained and documented. The long-term fixes, if available, are added to the team scrum backlog to prioritize and work on. Meanwhile, the runbook with step-by-step instructions to resolve the issue if it reoccurs is added to the GameDay backlog, which is a list of solutions presented in descending order of probability of occurrence and severity.
During a GameDay:
- An important aspect for a GameDay to be successful is to have a good cadence. The team should decide on a cadence, just like it decides on the duration of a sprint. Our team follows a cadence of 1 month.
- For each month’s GameDay, we choose a scenario/runbook with a high probability of occurrence and severity that has not already been used in a GameDay. Otherwise, we use the least recently chosen scenario for a GameDay.
- For each GameDay, a driver and executor are chosen and the whole team attends the event. A driver is a team member who has experience with the scenario and who will lead the solution to the problem. The executor can be anyone, ideally chosen in a round-robin fashion. The goal is to practice in such a way that any member of the team should be able to handle the scenario.
- During the event, the executor performs the steps in the runbook as-is, and the driver guides if necessary. The driver or a member of the team also notes the time taken for each step, corrections for steps in the runbook, and additional things to take care of. At the end of the event, the team discusses the improvements and scope for permanent, long-term fixes. If any, they are added to the team scrum backlog. The runbook is updated with the corrections and improvements. A recording of the event is also added to the runbook and added back to the bottom of the GameDay backlog.
Are GameDays suitable for all teams?
“Yes” is the one-word answer. GameDay is a must-have event for every software development team that takes care of the software from inception to maintenance and ideally involves the full team. However, the same framework could be adapted for organizations that have separate functions for software development, deployments, and maintenance. The framework presented in this blog has been tested and refined over time, and we are able to tell with confidence that it works well.
What are the benefits of a GameDay?
- The team is prepared, trained, and equipped to face non-routine events in production systems.
- The runbooks are fine-tuned, rectified of any mistakes, and provide step-by-step instructions for any team member to execute with confidence when an actual incident happens.
- With repeated practice runs, every member of the team becomes equipped to execute operational practices when the need arises; there is no single point of failure or dependency on a single person.
- The team is aware of scenarios that require the execution of operational practices and looks for ways to tackle them proactively or automate them.
- GameDay provides a good learning platform.
GameDay Best Practices
- Cadence – It is important to have a well-defined cadence. Once a month has worked well for our teams.
- Zero downtime is the mantra – Always keep in mind, the first steps in the runbooks should be to have steps that ensure zero downtime of the system. Whether downtime is managed by diverting traffic or triggering a DR solution, it should be a foolproof, well-tested solution.
- Good topic/scenario selection – Choosing a good topic will result in an effective GameDay. Ideally, the topic should be one with a high probability of occurrence and high severity. Another way to select a topic can be based on recent incidents so the fixes/updates are working as expected and the team is prepared to deal with any issues that may come up.
- Test out RPO and RTO times – GameDays are a great way to simulate an incident and to check if a procedure to mitigate it meets the defined recovery point objective (RPO) and recovery time objective (RTO) for the business unit.
- Round-robin executors – Choosing executors in a round-robin fashion allows different team members to practice catastrophic situations. This also allows the on-call to be up-to-date and handle the situation if it comes up.
- Diligently record every detail – It is important that the runbooks contain all the information. During an actual incident, the engineer handling it will already be under pressure. So, do not assume certain procedures are known.
- A GameDay should be fun – Mistakes and fumbles can happen. In fact, the aim is to identify and rectify common errors when performing a procedure. However, be sure to play it with spirits high!