By Rich Benoit
While ushering in the cloud era means a lot of different things to a lot of different people, one thing is for sure: operations can’t stay the same. To leverage the value and power of the cloud, IT organizations need to:
- Solve the challenge of too many alerts with dynamic thresholds
- Collect the right information
- Understand how to best use the new alerts
- Improve the use of dynamic thresholds
- Ensure the team has the right roles to support the changing environment
These steps can often be addressed by using the functionality within VMware vRealize Operations Manager, as described below.
1) Solve the challenge of too many alerts with dynamic thresholds
In the past when we tried to alert based on the value of a particular metric, we found that it tended to generate too many false positives. Since false positives tend to lead to the alerts being ignored, we raise the value of hard threshold for the alert until we no longer get false positives. The problem is that users are now calling in before the alert actually triggers, defeating the purpose of the alert in the first place. As a result, we tend to monitor very few metrics because of the difficulty in finding a satisfactory result.
However, now we can leverage dynamic thresholds generated by analytics. These dynamic thresholds identify the normal range for a wide range of metrics according to the results of competing algorithms that best try to model the behavior for each metric over time. Some algorithms are based on time such as day of the week, while others are based on mathematical formulas. The result is a range of expected behavior for each metric for a particular time period.
One of the great use cases for dynamic thresholds is that they identify the signature of applications. For example, they can show that the application always runs slow on Monday mornings or during month-end processing. Each metric outside of the normal signature constitutes an anomaly. If enough anomalies occur, an early warning smart alert can be generated within vRealize Operations Manager that indicates that something has changed significantly within the application and someone should investigate to see if there’s a problem.
2) Collect the right information
As we move from more traditional, client-server era environments to cloud era environments, many teams still use monitoring that has been optimized for the previous era (and tends to be siloed and component-based, too).
It’s not enough to just look at what’s happening with a particular domain or what’s going on with up-down indicators. In the cloud era, you need to look at performance that’s more aligned with the business and the user experience, and move away from a view focused on a particular functional silo or resource.
By putting those metrics into a form that an end user can relate to, you can give your audience better visibility and improve their experience. For example, if you were to measure the response time of a particular transaction, when a user calls in and says, “It’s slow today,” you can check the dynamic thresholds generated by the analytics that show the normal behavior for that transaction and time period. If indeed the response times are within the normal range, you can show the user that although the system may seem slow, it’s the expected behavior. If on the other hand the response times are higher than normal, a ticket could be generated for the appropriate support team to investigate. Ideally, the system would have already generated an alert that was being researched if a KPI Smart Alert had been set up within vRealize Operations Manager for that transaction response time.
3) Understand how to best use the new alerts
You may be wondering: Now that I have these great new alerts enabled by dynamic thresholds, how can I best leverage them? Although they are far more actionable than previous metric-based alerts, the new alerts may still need some form of human interaction to make sure that the proper action is taken. For example, it is often suggested that when a particular cluster in a virtualized environment starts having performance issues that an alert should be generated that would burst its capacity. The problem with this approach is that although performance issues can indicate a capacity issue, they can also indicate a break in the environment.
The idea is to give the user as much info as they need when an alert is generated to make a quick, well-informed decision and then have automations available to quickly and accurately carry out their decision. Over time, automations can include more and more intelligence, but it’s still hard to replace the human touch when it comes to decision making.
4) Improve the use of dynamic thresholds
A lot of monitoring tools are used after an issue materializes. But implementing proactive processes gives you the opportunity to identify or fix an issue before it impacts users. It’s essential that the link to problem management be very strong so processes can be tightly integrated, as shown in figure 1.
Figure 1: Event incident problem cycle
During the Problem Management Root Cause Analysis process, behaviors or metrics are often identified that are leading indicators for imminent impacts to the user experience. As mentioned earlier, vRealize Operations Manager, as the analytics engine, can create both KPI and Early Warning smart alerts, at the infrastructure, application, and end-user level to alert on these behaviors or metrics. By instrumenting these key metrics within the tool you can create actionable alerts in the environment.
5) Ensure the team has the right roles to support the changing environment.
With the new found abilities enabled by an analytics engine like vRealize Operations Manager, the roles and its structure become more critical. As shown in figure 2 below, the analyst role should be there to identify and document the opportunity for improvement, as well as, report on the KPIs that indicate the effectiveness of the alerts already in place. In addition, developers are needed to develop the new alerts and other content within vRealize Operations Manager.
Figure 2: New roles to support the changing environment
In a small organization, one person may be performing all of these functions, while in a larger organization, an entire team may perform a single role. This structure can be flexible depending on the size of the organization, but these roles are all critical to leveraging the capabilities of vRealize Operations Manager.
By implementing the right metrics, right KPIs, right level of automation, and putting the right team in place, you’ll be primed for success in the cloud era.
Richard Benoit is an Operations Architect with the VMware Operations Transformation global practice.