Home > Blogs > VMware Operations Transformation Services

Problem Management with vCenter Operations: Dealing with Events and Incidents Before They Impact Users

By: Pierre Moncassin

In some more traditional IT environments, if you have “problem manager” anywhere near your job title, you are probably faced with formidable challenges.

Let me guess… your mission is to steer the IT infrastructure clear of forthcoming issues – sometimes referred to as root causes – that will lead to incidents. Most of the time, though, you can only see what occurred in the past. To take a page from the famous TV Series, an incident has occurred and detective Columbo is called to the scene. What has occurred, he asks? Is there a pattern? Did anyone notice other incidents occurring around the same time?

That kind of thing you can probably do in your sleep. But however talented a detective you may be, this fact remains: You likely have little visibility into future incidents. You see some clues scattered around (also known as alerts), but these alerts cannot be readily interpreted without hours of manual work.

Fortunately, a tool like vCenter Operations Manager allows you to accelerate the scenario for Problem Management. Think of it as an assistant that can connect all the clues together and link them to potential suspects (root causes). The groundwork is done for you so that you can focus on the truly proactive work.

But vCenter Operations Manager pushes the envelope even further. Proactive analytics can detect impending outages before users are impacted. In detective terms, not only can you identify the suspects faster, you get an advance notice on their next move.

Now enough theory – let’s see how that works in practice.

Fig 1.
First off, let us look at the Health Badge (Fig 1.) which is built in as standard with vCenter Operations Manager. It is a dashboard that can provide you with instant visibility into the current state of the infrastructure. You can not only identify immediate issues but also use proactive capabilities like the risk badge to detect which areas of the infrastructure might fail in the future. In a nutshell: You don’t need to wait for an outage before responding.

Fig 2.
Another way to identify potential issues is by setting up Early Warning Smart Alerts in vCenter Operations Manager. These are alerts designed to tell you that some infrastructure components underpinning your cloud services are not operating “normally”. Unless it’s a traditional incident/response scenario, your overall service may well be operating perfectly fine – but the alert tells you that an issue will soon need attention and gives you a chance to be pro-active about it.

vCenter Operations Manager deploys advanced analytics to determine whether a component is operating within a “norm.” For now, it’s enough to say that once vCOps detects “abnormal” components beyond a certain threshold, an Early Warning Smart Alert is issued. It is the signal for the detective (a.k.a. the Problem Manager) to start investigating.

As soon as a potential issue is identified, you can drill into potential root causes (as shown in Fig. 2, right hand side). It is only a short step then from detection to active prevention and remediation. If the vCenter Configuration Manager (vCM) toolset is also deployed, you can directly access the virtual infrastructure configuration and review what recent change events have occurred. If the issue is related to a known change event within VCM, you may be able to roll back the change with a single command.

In summary, the toolsets not only accelerate detection, they also allow you to take appropriate preventative actions.

Right, but is it always that easy? Not always, of course. There are situations where there are so many alerts triggered (e.g. “Alert Storms”) that the root cause becomes harder to identify. But again, the good news is that there are known ways to cut down the noise – see our earlier blog, “Tips for Using KPIs to Filter Noise with vCenter Operations Manager” for more details.

The bottom line is that if you are a Problem Manager using vCenter Operations Manager, you will see your work increasingly shifting from reactive to proactive tasks. This is because you can let automation do the groundwork. (I digress a little here, but you will find that the same happens across many traditional IT roles when moving to a vCloud Infrastructure. Less time spent on physical-world “nuts and bolts” frees more time for proactive planning. By the way, if you are curious to see how the roles evolve, check out our “Organizing for the Cloud” white paper.)

In conclusion, here are three technical reasons why VMware vCenter Operations Manager will be a game-changer for you:

  • You will accelerate root cause analysis with instant drill-down access into infrastructure issues that may impact your overall services.
  • You get a comprehensive view of the infrastructure situation via visual summaries, like the Health dashboards.
  • Last but not least, you leverage proactive analytics to get an early notice of impending incidents. Now that is something that even detective Columbo did not have.

Follow @VMwareCloudOps on Twitter for future updates, and join the conversation by using the #CloudOps and #SDDC hashtags on Twitter.