Home > Blogs > VMware Operations Transformation Services


Guidance for Major Incident Management Decisions

By Brian Florence

Brian Florence-cropIf you’re an IT director or CIO of a corporation that has large, business-critical environments, you’re very aware that if those environments are unavailable for any length of time, your company will be losing a lot of money every minute of that downtime (millions of dollars, even).

Most of my IT clients manage multiple environments, many of which fall into the business-critical category. One proactive step is to define “key” or “critical” environments, which can be assigned to a specific individual accountable for the restoration of service for that environment.

The Information Technology Infrastructure Library (ITIL) defines a typical incident management process as one that is designed to restore services as quickly as possible, and a “major incident” management process is designed to focus specifically on business-critical service restoration. When there are incidents causing major business impact that are beyond typical major incident management functions,  it’s important to pinpoint accountability (special attention, even beyond their regular major incident process) for those business-critical environments where your company would experience a significant loss of capital or critical functionality.

The First Responder Role

Under multiple business-critical environment scenarios, each major environment is assigned a first responder who assumes the major incident lead role for accountability and leadership. The first responder has accountabilities that are typically over and above the normal incident management processes for which an incident manager and/or major incident manager may be responsible. The first responder’s accountabilities are to:

  • Restore service for those incidents that fall into the agreed-upon top priority assignment (P0/P1, or S0/S1, depending upon whether priority or severity is the chosen terminology), as well as all technical support team escalations and communications to management regarding incident status and follow-up, once resolved.
  • Create documentation to guide the service restoration process (often referred to as a playbook or other unique name recognized for each major environment), which specifies contacts for technical teams, major incident management procedures for that specific environment, identification of the critical infrastructure components that make up the environment, or other environment-specific details that would be needed for prompt service restoration and understanding of the environment.
  • Develop the post-incident review process and communications, including the follow-up problem management process (in coordination with any existing problem management team) to ensure its successful completion and documentation.

I also recommend that this primary process management role of accountability be assigned to someone familiar with all of the components and processes of the specific environment they are responsible for, so the management process can run as smoothly as possible for business-critical incidents.

Reducing the Business-Impact of Major Incidents

With a first responder in place, the procedure for resolving major incidents is more prescribed. With each major incident, your company learns what is causing incidents—and most importantly, has a documented process in place for resolution.  Ultimately, the incidents are resolved faster and more efficiently, and your company avoids costly loss of critical functionality or capital due to downtime and is able to avoid similar incidents in the future

The business increasingly looks to IT to drive innovation. By keeping business-critical environments available, you can deliver on business goals that contribute to the bottom line.

—–
Brian Florence is a transformation consultant with VMware Accelerate Advisory Services and is based in Michigan.