Home > Blogs > VMware Operations Transformation Services > Tag Archives: rich benoit

Tag Archives: rich benoit

4 Ways to Maximize the Value of VMware vRealize Operations Manager

By Rich Benoit

Benoit-cropWhen installing an enterprise IT solution like VMware vRealize Operations Manager (formerly vCenter Operations Manager), supporting the technology implementation with people and process changes is paramount to your organization’s success.

We all have to think about impacts beyond the technology any time we make a change to our systems, but enterprise products require more planning than most. Take, for example, the difference between installing VMware vSphere compared to an enterprise product. The users affected by vSphere generally sit in one organization, the toolset is fairly simple, little to no training is required, and time from installation to extracting value is a matter of days. Extend this thinking to enterprise products and you have many more users and groups affected, a much more complex toolset, training required for most users, and weeks or months from deployment to extracting real value from the product. Breaking it down like this, it’s easy to see the need to address supporting teams and processes to maximize value.

Here’s a recent example from a technology client I worked with that is very typical of customers I talk to. Management felt they were getting very little value from vRealize Operations Manager. Here’s what I learned:

  • Application dashboards in vRealize Operations Manager were not being used (despite extensive custom development).
  • The only team using the tool was virtual infrastructure (very typical).
  • They had not defined roles or processes to enable the technology to be successful. outside of the virtual infrastructure team.
  • There was no training or documentation for ongoing operations.
  • The customer was not enabled to maintain or expand the tool or its content.

My recommendations were as follows, and this goes for anyone implementing vRealize Operations Manager:

  1. Establish ongoing training and documentation for all users.
  2. Establish an analyst role to define, measure and report on processes and effectiveness related to vRealize Operations Manager and to also establish relationships with potential users and process areas of vRealize Operations Manager content.
  3. Establish a developer role to create and modify content based on the analyst’s collected requirements and fully leverage the extensive functionality vRealize Operations Manager provides.
  4. Establish an architecture board to coordinate an overall enterprise management approach, including vRealize Operations Manager.

The key takeaway here: IT transformation isn’t a plug-and-play proposition, and technology alone isn’t enough to make it happen. This applies especially to a potentially enterprise-level tool like vRealize Operations Manager. In order to maximize value and avoid it becoming just another silo-based tool, think about the human and process factors. This way you’ll be well on the way towards true transformational success for your enterprise.

—-
Rich Benoit is an Operations Architect with the VMware Operations Transformation global practice.

3 Steps to Get Started with Cloud Event, Incident, and Problem Management

By Rich Benoit

Benoit-cropWe are now well entrenched in the Age of Software. Regardless of the industry, there is someone right now trying to develop software that will turn that industry on its head. Previously, companies worked with one app that had the infrastructure along with it. It was all one technology, and one vendor’s solutions. Now there are tiers all over the place, and the final solution uses multiple components and technologies, as well as virtualization. This app is a shape shifter, one that changes based on the needs of the business. When application topology is changing like this over time, it creates a major challenge for event, incident, and problem management.

Addressing that challenge involves three major steps that will affect the people, processes, and technologies involved in managing your app.

1. Visualize with unified view
The standard approach to monitoring is often component- or silo-focused. This worked well when apps were vertical where an entire application was on one server; but with a new, more horizontal app that spans multiple devices and technologies – physical, virtual, web – you need a unified view that shows all tiers and technologies of an application. That view has to aggregate a wide range of data sources in a meaningful way, and then identify new metrics and metric sources. The rule of thumb should be that each app gets its own set of dashboards: “big screen” dashboards for the operations center that shows actionable information for event and incident management; detailed interactive dashboards that allow the application support team to drill down into their app; and management level dashboards that show a summary business view of application health and KPIs.

By leveraging these dashboards, event and incident management teams can pull up in real time to diagnose any issues that arise (see example below). Visualization is key in this approach, because it allows you to coordinate the data in a way that will actually allow for identification of events, incidents, and problems.

big screen dbVMware® vCenter™ Operations Manager™ “big screen” dashboard

2. Aggregate
When you’re coordinating a number of distributed apps, establishing timelines and impact becomes a much more complicated process. Here’s where your unified view can start to help identify problems before they occur. Track any changes that occur, and then map them back to any changes that have happened. When I’m working with clients, I demonstrate the VMware® vCenter™ Operations Manager™ ability to establish dynamic thresholds. The dynamic thresholds track back what constitutes common fluctuations, and leverages those analytics to establish baselines around what constitutes “normal.” By looking at the overall data in a big picture, the software can avoid false triggering around normal events.

3. Leveraging Problem Management
Ideally, you will be catching events and incidents before they result in downtime. However, that requires constantly looking for new metrics and metrics sources to create a wider view of the app. Problem management teams should be trained to identify opportunities for new metrics and new metrics sources. From there, the development team should take those new metrics and incorporate them into the unified view. When an issue occurs, and you look for the root cause, also stop to see if any specific metrics changed directly before the problem occurred. Tracking those metrics could alert you to a possible outage before it occurs the next time. Problem management then becomes a feedback loop where you identify the root cause, look at the surrounding metric, and then update the workflows to identify precursors to problems.

This doesn’t require you to drastically change how you are managing problems. Instead, it just involves adding an extra analytics step that will help with prevention. The metrics you’re tracking through the dashboard will generally fall into three basic buckets:

  • Leading indicators for critical infrastructure
  • Leading indicators for critical application, and
  • Metrics that reflect end-user experiences

Once you have established the value of finding and visualizing those metrics, the task of problem management becomes proactive, rather than reactive, and the added level of complexity becomes far more manageable.

—————-
Richard Benoit is an Operations Architect with the VMware Operations Transformation global practice and is based in Michigan.

The Lowly Metric Has Its Day in the Sun

By Rich Benoit

Back in the day, I would have killed for a tool like vCOps, an analytics tool that uses dynamic thresholds to make sense of the myriad activity metrics that exist in an IT environment. Without dynamic thresholds that identify normal behavior, admins like myself are forced to use static thresholds that never seemed to work quite right. Static thresholds tended either to be set too low, resulting in false positives, or too high, so that by the time they were tripped, the support desk had already started receiving calls from disgruntled users.

Tried, but Failed

  • One approach I tried in order make sense of the cloud of data coming from multiple monitoring tools was to combine several metrics to get a more holistic view. Combined metrics also rely on static thresholds and are similarly plagued with false positives. But, they introduce the additional problem of having to try and figure out which of the underlying metrics actually caused the alarm to trip.
  • Another approach I tried was using end-user experience monitoring, or end-to-end application monitoring. Instead of trying to estimate the performance of an application by looking at the sum of all of its components, I could instead look at the simulated response time for the typical user and transaction. Another end-to-end monitoring tactic was to employ passive application sniffers that would record the response time of transactions. But with both approaches, I was still dependent on static hard thresholds that were invariably exceeded on a regular basis. For example, it wouldn’t be unusual for an application to exceed its 2-second response time goal during regular periods of peak usage. So I had to know when it was normal to exceed the allowed threshold.  In other words, I had to know when to ignore the alarms.
  • Static thresholds also impacted performance monitoring. Other admins would ask, “Did this just start?” or “Is the performance issue the result of a change in the environment?” The monitoring tools wouldn’t provide the needed data. So we would have to roll up our sleeves and try to figure out what happened. Meanwhile the system would be down or just struggling along. Many times the problem would go away after a certain amount of time or after a reboot, only to resurface another day.

In the end, except for a few cases, we just turned off the monitors and alarms.

A Better Approach

That is why I would have killed for vCOps. vCenter Operations Management Suite is built on an open and extensible platform that works with physical and virtual machines.  It is a single solution works with a variety of hypervisors and fits either on-premise or public cloud environments.

It collects and stores metrics over time and works behind the scenes to establish dynamic thresholds. It employs around 18 different algorithms that compete to best fit any one of the millions of metrics it can track. Some algorithms are based on time intervals and others on mathematical models.

With vCops I can now designate specific metrics as KPIs for additional granularity. For example, the tool would learn that it is normal for response times to be in the 2 to 4 second range on Monday mornings, but if it exceeds the normal range, above or below, I can now have a KPI Smart Alert generated.

Another thing that I can use is the Early Warning Smart Alert that detects change in the environment when too many anomalies occur, such as when too many metrics are outside their normal operating range. I can use the various dashboards and detail screens to view the metrics over time, so that instead of wondering whether the issue is the result of a capacity trend or something changing / breaking, I can look and quickly see, “Oh, there’s the problem. Something happened at 1:15 on system X that caused this service to really slow down.”

Now, after more than 20 years in IT, I can finally start to use the multitude of metrics that have been there just waiting to be leveraged.

To get the most out of monitoring tools consider using vCops range of capabilities, including:

  • The ability to track KPIs within the infrastructure, such as Disk I/O or CPU Ready, or leverage the vSphere UI so that you know if your infrastructure has additional capacity or not.
  • Various KPI Super Metrics within the application stack (e.g. cache hit rate or available memory) that alert you when things are outside of a normal range.
  • The power to see exactly how an environment is performing on a given day, and the ability to isolate which component is the source of the issue.
  • The means to track and report the relative health of not only your components, but your services as well, without having to view everything as up or down at the component level and guess if the application or service is OK.

And it’s all possible because we can now actually use the lowly metric.

For future updates, follow @VMwareCloudOps on Twitter and join the conversation using the #CloudOps and #SDDC hashtags.