Home > Blogs > VMware Operations Transformation Services > Tag Archives: vRealize Operations Manager

Tag Archives: vRealize Operations Manager

How to Take Charge of Incident Ticket Ping Pong

By Pierre Moncassin

Pierre Moncassin-cropWhen incident tickets are repeatedly passed from one support team to another, I like to describe it as a “ping pong” situation. Most often this is not a lack of accountability or skills within individual teams. Each team genuinely fails to see the incident as relevant to their technical silo. They each feel perfectly legitimate in either assigning the ticket to another team, or even assigning it back to the team they took it from.

And the ping pong game continues.

Unfortunately for the end user, the incident is not resolved whilst the reassignments continue. The situation can easily escalate into SLA breaches, financial penalties, and certainly disgruntled end users.

How can you prevent such situations? IT service management (ITSM) has been around for a long while, and there are known mitigations to handle these situations. Good ITSM practice would dictate some type of built-in mechanisms to prevent incidents being passed back and forth. For example:

  • Define end-to-end SLAs for incident resolution (not just KPIs for each resolution team), and make each team aware of these SLAs.
  • Configure the service desk tool to escalate automatically (and issue alerts) after a number of reassignments, so that management becomes quickly aware of the situation.
  • Include cross-functional resolution teams as part of the resolution process (as is often done for major incident situations).

In my opinion there is a drawback to these approaches—they take time and effort to put in place; incidents may still fall through the cracks. But with a cloud management platform like VMware vRealize Suite, you can take prevention to another level.

A core reason for ping pong situations often lies in the team’s inability to pinpoint the root cause of the incident. VMware vRealize Operations Manager (formerly known as vCenter Operations Manager) provides increased visibility into the root cause, through root cause analysis capabilities. Going one step further, vRealize Operations Manager gives advance warning on impending incidents—thanks to its analytical capabilities. In the most efficient scenario, support teams are warned of the impending incident and its cause, well ahead of the incident being raised. Most of the time, the incident ping pong game should never start.

Takeaways:

  • Build a solid foundation with the classic ITSM approaches based on SLAs and assignment rules.
  • Leverage proactive resolution, and take advantage of enhanced root cause analysis that vRealize Operations Manager offers via automation to reduce time wasted on incident resolution.


Pierre Moncassin is an operations architect with the VMware Operations Transformation global practice and is based in Taipei. Follow @VMwareCloudOps on Twitter for future updates.

 

Leveraging Proactive Analytics to Optimize IT Response

By Rich Benoit

Benoit-cropWhile ushering in the cloud era means a lot of different things to a lot of different people, one thing is for sure: operations can’t stay the same. To leverage the value and power of the cloud, IT organizations need to:

  1. Solve the challenge of too many alerts with dynamic thresholds
  2. Collect the right information
  3. Understand how to best use the new alerts
  4. Improve the use of dynamic thresholds
  5. Ensure the team has the right roles to support the changing environment

These steps can often be addressed by using the functionality within VMware vRealize Operations Manager, as described below.

1) Solve the challenge of too many alerts with dynamic thresholds
In the past when we tried to alert based on the value of a particular metric, we found that it tended to generate too many false positives. Since false positives tend to lead to the alerts being ignored, we raise the value of hard threshold for the alert until we no longer get false positives. The problem is that users are now calling in before the alert actually triggers, defeating the purpose of the alert in the first place. As a result, we tend to monitor very few metrics because of the difficulty in finding a satisfactory result.

However, now we can leverage dynamic thresholds generated by analytics. These dynamic thresholds identify the normal range for a wide range of metrics according to the results of competing algorithms that best try to model the behavior for each metric over time. Some algorithms are based on time such as day of the week, while others are based on mathematical formulas. The result is a range of expected behavior for each metric for a particular time period.

One of the great use cases for dynamic thresholds is that they identify the signature of applications. For example, they can show that the application always runs slow on Monday mornings or during month-end processing. Each metric outside of the normal signature constitutes an anomaly. If enough anomalies occur, an early warning smart alert can be generated within vRealize Operations Manager that indicates that something has changed significantly within the application and someone should investigate to see if there’s a problem.

2) Collect the right information
As we move from more traditional, client-server era environments to cloud era environments, many teams still use monitoring that has been optimized for the previous era (and tends to be siloed and component-based, too).

It’s not enough to just look at what’s happening with a particular domain or what’s going on with up-down indicators. In the cloud era, you need to look at performance that’s more aligned with the business and the user experience, and move away from a view focused on a particular functional silo or resource.

By putting those metrics into a form that an end user can relate to, you can give your audience better visibility and improve their experience. For example, if you were to measure the response time of a particular transaction, when a user calls in and says, “It’s slow today,” you can check the dynamic thresholds generated by the analytics that show the normal behavior for that transaction and time period. If indeed the response times are within the normal range, you can show the user that although the system may seem slow, it’s the expected behavior. If on the other hand the response times are higher than normal, a ticket could be generated for the appropriate support team to investigate. Ideally, the system would have already generated an alert that was being researched if a KPI Smart Alert had been set up within vRealize Operations Manager for that transaction response time.

3) Understand how to best use the new alerts

You may be wondering: Now that I have these great new alerts enabled by dynamic thresholds, how can I best leverage them?  Although they are far more actionable than previous metric-based alerts, the new alerts may still need some form of human interaction to make sure that the proper action is taken. For example, it is often suggested that when a particular cluster in a virtualized environment starts having performance issues that an alert should be generated that would burst its capacity. The problem with this approach is that although performance issues can indicate a capacity issue, they can also indicate a break in the environment.

The idea is to give the user as much info as they need when an alert is generated to make a quick, well-informed decision and then have automations available to quickly and accurately carry out their decision. Over time, automations can include more and more intelligence, but it’s still hard to replace the human touch when it comes to decision making.

4) Improve the use of dynamic thresholds
A lot of monitoring tools are used after an issue materializes. But implementing proactive processes gives you the opportunity to identify or fix an issue before it impacts users. It’s essential that the link to problem management be very strong so processes can be tightly integrated, as shown in figure 1.

event incident problem cycle

Figure 1: Event incident problem cycle

During the Problem Management Root Cause Analysis process, behaviors or metrics are often identified that are leading indicators for imminent impacts to the user experience. As mentioned earlier, vRealize Operations Manager, as the analytics engine, can create both KPI and Early Warning smart alerts, at the infrastructure, application, and end-user level to alert on these behaviors or metrics. By instrumenting these key metrics within the tool you can create actionable alerts in the environment.

5) Ensure the team has the right roles to support the changing environment.
With the new found abilities enabled by an analytics engine like vRealize Operations Manager, the roles and its structure become more critical. As shown in figure 2 below, the analyst role should be there to identify and document the opportunity for improvement, as well as, report on the KPIs that indicate the effectiveness of the alerts already in place. In addition, developers are needed to develop the new alerts and other content within vRealize Operations Manager.

new roles

Figure 2: New roles to support the changing environment

In a small organization, one person may be performing all of these functions, while in a larger organization, an entire team may perform a single role. This structure can be flexible depending on the size of the organization, but these roles are all critical to leveraging the capabilities of vRealize Operations Manager.

By implementing the right metrics, right KPIs, right level of automation, and putting the right team in place, you’ll be primed for success in the cloud era.

—-
Richard Benoit is an Operations Architect with the VMware Operations Transformation global practice.