Home > Blogs > VMware CloudOps

3 Steps to Get Started with Cloud Event, Incident, and Problem Management

By Rich Benoit

Benoit-cropWe are now well entrenched in the Age of Software. Regardless of the industry, there is someone right now trying to develop software that will turn that industry on its head. Previously, companies worked with one app that had the infrastructure along with it. It was all one technology, and one vendor’s solutions. Now there are tiers all over the place, and the final solution uses multiple components and technologies, as well as virtualization. This app is a shape shifter, one that changes based on the needs of the business. When application topology is changing like this over time, it creates a major challenge for event, incident, and problem management.

Addressing that challenge involves three major steps that will affect the people, processes, and technologies involved in managing your app.

1. Visualize with unified view
The standard approach to monitoring is often component- or silo-focused. This worked well when apps were vertical where an entire application was on one server; but with a new, more horizontal app that spans multiple devices and technologies – physical, virtual, web – you need a unified view that shows all tiers and technologies of an application. That view has to aggregate a wide range of data sources in a meaningful way, and then identify new metrics and metric sources. The rule of thumb should be that each app gets its own set of dashboards: “big screen” dashboards for the operations center that shows actionable information for event and incident management; detailed interactive dashboards that allow the application support team to drill down into their app; and management level dashboards that show a summary business view of application health and KPIs.

By leveraging these dashboards, event and incident management teams can pull up in real time to diagnose any issues that arise (see example below). Visualization is key in this approach, because it allows you to coordinate the data in a way that will actually allow for identification of events, incidents, and problems.

big screen dbVMware® vCenter™ Operations Manager™ “big screen” dashboard

2. Aggregate
When you’re coordinating a number of distributed apps, establishing timelines and impact becomes a much more complicated process. Here’s where your unified view can start to help identify problems before they occur. Track any changes that occur, and then map them back to any changes that have happened. When I’m working with clients, I demonstrate the VMware® vCenter™ Operations Manager™ ability to establish dynamic thresholds. The dynamic thresholds track back what constitutes common fluctuations, and leverages those analytics to establish baselines around what constitutes “normal.” By looking at the overall data in a big picture, the software can avoid false triggering around normal events.

3. Leveraging Problem Management
Ideally, you will be catching events and incidents before they result in downtime. However, that requires constantly looking for new metrics and metrics sources to create a wider view of the app. Problem management teams should be trained to identify opportunities for new metrics and new metrics sources. From there, the development team should take those new metrics and incorporate them into the unified view. When an issue occurs, and you look for the root cause, also stop to see if any specific metrics changed directly before the problem occurred. Tracking those metrics could alert you to a possible outage before it occurs the next time. Problem management then becomes a feedback loop where you identify the root cause, look at the surrounding metric, and then update the workflows to identify precursors to problems.

This doesn’t require you to drastically change how you are managing problems. Instead, it just involves adding an extra analytics step that will help with prevention. The metrics you’re tracking through the dashboard will generally fall into three basic buckets:

  • Leading indicators for critical infrastructure
  • Leading indicators for critical application, and
  • Metrics that reflect end-user experiences

Once you have established the value of finding and visualizing those metrics, the task of problem management becomes proactive, rather than reactive, and the added level of complexity becomes far more manageable.

Richard Benoit is an Operations Architect with the VMware Operations Transformation global practice and is based in Michigan.