Home > Blogs > VMware Operations Transformation Services > Tag Archives: incident management

Tag Archives: incident management

How to Take Charge of Incident Ticket Ping Pong

By Pierre Moncassin

Pierre Moncassin-cropWhen incident tickets are repeatedly passed from one support team to another, I like to describe it as a “ping pong” situation. Most often this is not a lack of accountability or skills within individual teams. Each team genuinely fails to see the incident as relevant to their technical silo. They each feel perfectly legitimate in either assigning the ticket to another team, or even assigning it back to the team they took it from.

And the ping pong game continues.

Unfortunately for the end user, the incident is not resolved whilst the reassignments continue. The situation can easily escalate into SLA breaches, financial penalties, and certainly disgruntled end users.

How can you prevent such situations? IT service management (ITSM) has been around for a long while, and there are known mitigations to handle these situations. Good ITSM practice would dictate some type of built-in mechanisms to prevent incidents being passed back and forth. For example:

  • Define end-to-end SLAs for incident resolution (not just KPIs for each resolution team), and make each team aware of these SLAs.
  • Configure the service desk tool to escalate automatically (and issue alerts) after a number of reassignments, so that management becomes quickly aware of the situation.
  • Include cross-functional resolution teams as part of the resolution process (as is often done for major incident situations).

In my opinion there is a drawback to these approaches—they take time and effort to put in place; incidents may still fall through the cracks. But with a cloud management platform like VMware vRealize Suite, you can take prevention to another level.

A core reason for ping pong situations often lies in the team’s inability to pinpoint the root cause of the incident. VMware vRealize Operations Manager (formerly known as vCenter Operations Manager) provides increased visibility into the root cause, through root cause analysis capabilities. Going one step further, vRealize Operations Manager gives advance warning on impending incidents—thanks to its analytical capabilities. In the most efficient scenario, support teams are warned of the impending incident and its cause, well ahead of the incident being raised. Most of the time, the incident ping pong game should never start.

Takeaways:

  • Build a solid foundation with the classic ITSM approaches based on SLAs and assignment rules.
  • Leverage proactive resolution, and take advantage of enhanced root cause analysis that vRealize Operations Manager offers via automation to reduce time wasted on incident resolution.


Pierre Moncassin is an operations architect with the VMware Operations Transformation global practice and is based in Taipei. Follow @VMwareCloudOps on Twitter for future updates.

 

Guidance for Major Incident Management Decisions

By Brian Florence

Brian Florence-cropIf you’re an IT director or CIO of a corporation that has large, business-critical environments, you’re very aware that if those environments are unavailable for any length of time, your company will be losing a lot of money every minute of that downtime (millions of dollars, even).

Most of my IT clients manage multiple environments, many of which fall into the business-critical category. One proactive step is to define “key” or “critical” environments, which can be assigned to a specific individual accountable for the restoration of service for that environment.

The Information Technology Infrastructure Library (ITIL) defines a typical incident management process as one that is designed to restore services as quickly as possible, and a “major incident” management process is designed to focus specifically on business-critical service restoration. When there are incidents causing major business impact that are beyond typical major incident management functions,  it’s important to pinpoint accountability (special attention, even beyond their regular major incident process) for those business-critical environments where your company would experience a significant loss of capital or critical functionality.

The First Responder Role

Under multiple business-critical environment scenarios, each major environment is assigned a first responder who assumes the major incident lead role for accountability and leadership. The first responder has accountabilities that are typically over and above the normal incident management processes for which an incident manager and/or major incident manager may be responsible. The first responder’s accountabilities are to:

  • Restore service for those incidents that fall into the agreed-upon top priority assignment (P0/P1, or S0/S1, depending upon whether priority or severity is the chosen terminology), as well as all technical support team escalations and communications to management regarding incident status and follow-up, once resolved.
  • Create documentation to guide the service restoration process (often referred to as a playbook or other unique name recognized for each major environment), which specifies contacts for technical teams, major incident management procedures for that specific environment, identification of the critical infrastructure components that make up the environment, or other environment-specific details that would be needed for prompt service restoration and understanding of the environment.
  • Develop the post-incident review process and communications, including the follow-up problem management process (in coordination with any existing problem management team) to ensure its successful completion and documentation.

I also recommend that this primary process management role of accountability be assigned to someone familiar with all of the components and processes of the specific environment they are responsible for, so the management process can run as smoothly as possible for business-critical incidents.

Reducing the Business-Impact of Major Incidents

With a first responder in place, the procedure for resolving major incidents is more prescribed. With each major incident, your company learns what is causing incidents—and most importantly, has a documented process in place for resolution.  Ultimately, the incidents are resolved faster and more efficiently, and your company avoids costly loss of critical functionality or capital due to downtime and is able to avoid similar incidents in the future

The business increasingly looks to IT to drive innovation. By keeping business-critical environments available, you can deliver on business goals that contribute to the bottom line.

—–
Brian Florence is a transformation consultant with VMware Accelerate Advisory Services and is based in Michigan.