Home > Blogs > VMware Operations Transformation Services > Tag Archives: event incident problem management

Tag Archives: event incident problem management

How to Take Charge of Incident Ticket Ping Pong

By Pierre Moncassin

Pierre Moncassin-cropWhen incident tickets are repeatedly passed from one support team to another, I like to describe it as a “ping pong” situation. Most often this is not a lack of accountability or skills within individual teams. Each team genuinely fails to see the incident as relevant to their technical silo. They each feel perfectly legitimate in either assigning the ticket to another team, or even assigning it back to the team they took it from.

And the ping pong game continues.

Unfortunately for the end user, the incident is not resolved whilst the reassignments continue. The situation can easily escalate into SLA breaches, financial penalties, and certainly disgruntled end users.

How can you prevent such situations? IT service management (ITSM) has been around for a long while, and there are known mitigations to handle these situations. Good ITSM practice would dictate some type of built-in mechanisms to prevent incidents being passed back and forth. For example:

  • Define end-to-end SLAs for incident resolution (not just KPIs for each resolution team), and make each team aware of these SLAs.
  • Configure the service desk tool to escalate automatically (and issue alerts) after a number of reassignments, so that management becomes quickly aware of the situation.
  • Include cross-functional resolution teams as part of the resolution process (as is often done for major incident situations).

In my opinion there is a drawback to these approaches—they take time and effort to put in place; incidents may still fall through the cracks. But with a cloud management platform like VMware vRealize Suite, you can take prevention to another level.

A core reason for ping pong situations often lies in the team’s inability to pinpoint the root cause of the incident. VMware vRealize Operations Manager (formerly known as vCenter Operations Manager) provides increased visibility into the root cause, through root cause analysis capabilities. Going one step further, vRealize Operations Manager gives advance warning on impending incidents—thanks to its analytical capabilities. In the most efficient scenario, support teams are warned of the impending incident and its cause, well ahead of the incident being raised. Most of the time, the incident ping pong game should never start.

Takeaways:

  • Build a solid foundation with the classic ITSM approaches based on SLAs and assignment rules.
  • Leverage proactive resolution, and take advantage of enhanced root cause analysis that vRealize Operations Manager offers via automation to reduce time wasted on incident resolution.


Pierre Moncassin is an operations architect with the VMware Operations Transformation global practice and is based in Taipei. Follow @VMwareCloudOps on Twitter for future updates.

 

Leveraging Proactive Analytics to Optimize IT Response

By Rich Benoit

Benoit-cropWhile ushering in the cloud era means a lot of different things to a lot of different people, one thing is for sure: operations can’t stay the same. To leverage the value and power of the cloud, IT organizations need to:

  1. Solve the challenge of too many alerts with dynamic thresholds
  2. Collect the right information
  3. Understand how to best use the new alerts
  4. Improve the use of dynamic thresholds
  5. Ensure the team has the right roles to support the changing environment

These steps can often be addressed by using the functionality within VMware vRealize Operations Manager, as described below.

1) Solve the challenge of too many alerts with dynamic thresholds
In the past when we tried to alert based on the value of a particular metric, we found that it tended to generate too many false positives. Since false positives tend to lead to the alerts being ignored, we raise the value of hard threshold for the alert until we no longer get false positives. The problem is that users are now calling in before the alert actually triggers, defeating the purpose of the alert in the first place. As a result, we tend to monitor very few metrics because of the difficulty in finding a satisfactory result.

However, now we can leverage dynamic thresholds generated by analytics. These dynamic thresholds identify the normal range for a wide range of metrics according to the results of competing algorithms that best try to model the behavior for each metric over time. Some algorithms are based on time such as day of the week, while others are based on mathematical formulas. The result is a range of expected behavior for each metric for a particular time period.

One of the great use cases for dynamic thresholds is that they identify the signature of applications. For example, they can show that the application always runs slow on Monday mornings or during month-end processing. Each metric outside of the normal signature constitutes an anomaly. If enough anomalies occur, an early warning smart alert can be generated within vRealize Operations Manager that indicates that something has changed significantly within the application and someone should investigate to see if there’s a problem.

2) Collect the right information
As we move from more traditional, client-server era environments to cloud era environments, many teams still use monitoring that has been optimized for the previous era (and tends to be siloed and component-based, too).

It’s not enough to just look at what’s happening with a particular domain or what’s going on with up-down indicators. In the cloud era, you need to look at performance that’s more aligned with the business and the user experience, and move away from a view focused on a particular functional silo or resource.

By putting those metrics into a form that an end user can relate to, you can give your audience better visibility and improve their experience. For example, if you were to measure the response time of a particular transaction, when a user calls in and says, “It’s slow today,” you can check the dynamic thresholds generated by the analytics that show the normal behavior for that transaction and time period. If indeed the response times are within the normal range, you can show the user that although the system may seem slow, it’s the expected behavior. If on the other hand the response times are higher than normal, a ticket could be generated for the appropriate support team to investigate. Ideally, the system would have already generated an alert that was being researched if a KPI Smart Alert had been set up within vRealize Operations Manager for that transaction response time.

3) Understand how to best use the new alerts

You may be wondering: Now that I have these great new alerts enabled by dynamic thresholds, how can I best leverage them?  Although they are far more actionable than previous metric-based alerts, the new alerts may still need some form of human interaction to make sure that the proper action is taken. For example, it is often suggested that when a particular cluster in a virtualized environment starts having performance issues that an alert should be generated that would burst its capacity. The problem with this approach is that although performance issues can indicate a capacity issue, they can also indicate a break in the environment.

The idea is to give the user as much info as they need when an alert is generated to make a quick, well-informed decision and then have automations available to quickly and accurately carry out their decision. Over time, automations can include more and more intelligence, but it’s still hard to replace the human touch when it comes to decision making.

4) Improve the use of dynamic thresholds
A lot of monitoring tools are used after an issue materializes. But implementing proactive processes gives you the opportunity to identify or fix an issue before it impacts users. It’s essential that the link to problem management be very strong so processes can be tightly integrated, as shown in figure 1.

event incident problem cycle

Figure 1: Event incident problem cycle

During the Problem Management Root Cause Analysis process, behaviors or metrics are often identified that are leading indicators for imminent impacts to the user experience. As mentioned earlier, vRealize Operations Manager, as the analytics engine, can create both KPI and Early Warning smart alerts, at the infrastructure, application, and end-user level to alert on these behaviors or metrics. By instrumenting these key metrics within the tool you can create actionable alerts in the environment.

5) Ensure the team has the right roles to support the changing environment.
With the new found abilities enabled by an analytics engine like vRealize Operations Manager, the roles and its structure become more critical. As shown in figure 2 below, the analyst role should be there to identify and document the opportunity for improvement, as well as, report on the KPIs that indicate the effectiveness of the alerts already in place. In addition, developers are needed to develop the new alerts and other content within vRealize Operations Manager.

new roles

Figure 2: New roles to support the changing environment

In a small organization, one person may be performing all of these functions, while in a larger organization, an entire team may perform a single role. This structure can be flexible depending on the size of the organization, but these roles are all critical to leveraging the capabilities of vRealize Operations Manager.

By implementing the right metrics, right KPIs, right level of automation, and putting the right team in place, you’ll be primed for success in the cloud era.

—-
Richard Benoit is an Operations Architect with the VMware Operations Transformation global practice.

3 Steps to Get Started with Cloud Event, Incident, and Problem Management

By Rich Benoit

Benoit-cropWe are now well entrenched in the Age of Software. Regardless of the industry, there is someone right now trying to develop software that will turn that industry on its head. Previously, companies worked with one app that had the infrastructure along with it. It was all one technology, and one vendor’s solutions. Now there are tiers all over the place, and the final solution uses multiple components and technologies, as well as virtualization. This app is a shape shifter, one that changes based on the needs of the business. When application topology is changing like this over time, it creates a major challenge for event, incident, and problem management.

Addressing that challenge involves three major steps that will affect the people, processes, and technologies involved in managing your app.

1. Visualize with unified view
The standard approach to monitoring is often component- or silo-focused. This worked well when apps were vertical where an entire application was on one server; but with a new, more horizontal app that spans multiple devices and technologies – physical, virtual, web – you need a unified view that shows all tiers and technologies of an application. That view has to aggregate a wide range of data sources in a meaningful way, and then identify new metrics and metric sources. The rule of thumb should be that each app gets its own set of dashboards: “big screen” dashboards for the operations center that shows actionable information for event and incident management; detailed interactive dashboards that allow the application support team to drill down into their app; and management level dashboards that show a summary business view of application health and KPIs.

By leveraging these dashboards, event and incident management teams can pull up in real time to diagnose any issues that arise (see example below). Visualization is key in this approach, because it allows you to coordinate the data in a way that will actually allow for identification of events, incidents, and problems.

big screen dbVMware® vCenter™ Operations Manager™ “big screen” dashboard

2. Aggregate
When you’re coordinating a number of distributed apps, establishing timelines and impact becomes a much more complicated process. Here’s where your unified view can start to help identify problems before they occur. Track any changes that occur, and then map them back to any changes that have happened. When I’m working with clients, I demonstrate the VMware® vCenter™ Operations Manager™ ability to establish dynamic thresholds. The dynamic thresholds track back what constitutes common fluctuations, and leverages those analytics to establish baselines around what constitutes “normal.” By looking at the overall data in a big picture, the software can avoid false triggering around normal events.

3. Leveraging Problem Management
Ideally, you will be catching events and incidents before they result in downtime. However, that requires constantly looking for new metrics and metrics sources to create a wider view of the app. Problem management teams should be trained to identify opportunities for new metrics and new metrics sources. From there, the development team should take those new metrics and incorporate them into the unified view. When an issue occurs, and you look for the root cause, also stop to see if any specific metrics changed directly before the problem occurred. Tracking those metrics could alert you to a possible outage before it occurs the next time. Problem management then becomes a feedback loop where you identify the root cause, look at the surrounding metric, and then update the workflows to identify precursors to problems.

This doesn’t require you to drastically change how you are managing problems. Instead, it just involves adding an extra analytics step that will help with prevention. The metrics you’re tracking through the dashboard will generally fall into three basic buckets:

  • Leading indicators for critical infrastructure
  • Leading indicators for critical application, and
  • Metrics that reflect end-user experiences

Once you have established the value of finding and visualizing those metrics, the task of problem management becomes proactive, rather than reactive, and the added level of complexity becomes far more manageable.

—————-
Richard Benoit is an Operations Architect with the VMware Operations Transformation global practice and is based in Michigan.