vRealize Operations

Deep Dive into the vRealize Operations Capacity Management 

Deep Dive into the vRealize Operations Capacity Management

To accurately serve your customers with their virtual machine needs, it is vital to ensure the continued usability and optimal performance of your infrastructure.  With the release of vRealize Operations 7.0, capacity management and planning has been improved to help you proactively address resource shortfalls. In this deep dive through the vRealize Operations Capacity Management, we will examine the key areas to help you better understand how the capacity assessment process works.

Datacenter Capcity Overview Capacity Management 

 

To get to the above Capacity Overview, we can begin by selecting Home  and then selecting Assess Capacity.

The datacenters with the least available resources will be displayed at the top by default.  You can scroll through the datacenters to find the specific datacenter you want to examine or use the filter to search for it.

Selecting ALL DATACENTERS will display all the datacenters and sort them by time remaining.  The sorting can be changed from Time Remaining to Cost Savings or Optimized.  Additionally, the datacenters can be grouped by Criticality or vCenter.

In a perfect environment, there are several things we would see on the Overview dashboard:

  1. All datacenters will be optimized and colored green.

Perfect Datacenter

  1. All clusters will have sufficient capacity  as defined by the thresholds you set in the policies  where all the clusters are showing status as ‘Normal’.

All clusters have sufficient capacity

  1. There will be zero cost savings and zero reclaimable resources necessary.

No Optimization Recommendations

It is very difficult to maintain the perfect environment so with most cases, we will have datacenters which are not optimized or colored red.

The “optimized” status and coloring are independent. Coloring is based on the available Time Remaining where the color is defined in the policy.  By default, a datacenter is colored green if the Time Remaining is greater than 120 days.  Optimized is a separate measurement, indicating the workloads are optimized according to the settings in the policy.  A datacenter can be labeled as “not optimized” and    colored green simultaneously or possibly be labeled as “optimized” but colored red.  The datacenter may also be colored Gray which indicates “unknown” due to the capacity data is not available.

To start, we click on a datacenter that needs to be addressed.

Datacenter Not Optimzed

In the example where the datacenter is not optimized and there is no time remaining, we must first evaluate the datacenter to see what actions must be taken.  From the Optimization Recommendations, we click View Optimization.

View Optimization

This will open the Workload Optimization dashboard where you can edit the Utilization Objective and apply the Optimization Recommendation.

The next thing we must understand is why the datacenter is indicating low capacity (i.e. CPU, Memory, Storage).  The color red identifies that there are clusters which need critical attention, meaning capacity has been reached or nearing maximum utilization and the time remaining is at a critical level   as defined by the thresholds you set in the policies .  We need to examine which clusters are impacted and what resources are being constrained.

In the Time Remaining, we will see the number of clusters in the datacenter and their status such as critical, medium, normal, or unknown.  Critical, medium, and normal are based on thresholds you set in the policies, but an unknown status means information cannot be obtained such as the cluster is offline.

Cluster Status

The Critical level can indicate a resource contention, imbalance, or other alarming condition. Thresholds you set in the policies define what is critical.

Next, we look at the Cluster Utilization.  We need to analyze which resources are being constrained and causing the cluster to be at the Critical level.

The resources considered for Cluster Utilization are CPU, Memory, and Disk Space.  You can click the information icon in Cluster Utilization  to view a brief description of these resources and what they are measuring:

Cluster Utilization Info

Going back to the ideal environment, if all the clusters have sufficient capacity, you will see the Cluster Utilization with green check marks across these different resources.

Capacity Metrics

This indicates that all the resources have sufficient capacity beyond the defined Critical Threshold.

If the cluster has reached the Critical level, one or more of these resources will indicate a red dot.

Red Capacity Metrics

No matter which resource shows as critical, the behavior of the time remaining graph is the same with just the resource being different.  By default, the Cluster Utilization is sorted by the most constrained resource.

Sort by Most Constrained

Let’s examine an example where the Cluster Utilization shows that the most constrained resource is Memory.

Memory Constraint Graph 1 Month Capacity Management 

The above graph shows that Memory will run out in 2 days.    Let’s examine how that is determined.

We need to understand what is defined as the Critical Threshold.  We select Link to Cluster Time Remaining Settings to view the Cluster Time Remaining Settings.

Cluster Time Remaining Settings

Now we know the time remaining for the Critical and Warning Thresholds.

Going back to the above graph, we see that Usable Memory is approximately 100M.  We also see the memory, as it is being utilized, has some peaks and valleys which is expected.  If we hover over a point on the graph, we obtain more details about that specific data point.

Memory Hover Point

The Memory Usable Capacity, measured in KB, is    determined as the amount of usable memory resources for Virtual Machines after considering reservations for vSphere High Availability (HA) and other vSphere Services.  The Memory Usable Capacity is calculated as:

If (HA not enabled) => Memory|Total Capacity ELSE => Sum([HostSystem]Memory|Capacity Available to VMs) * (100 – Cluster Configuration|DAS Configuration|HA Memory Failover Percent) / 100 + Sum([HostSystem]Memory|ESX System Usage)

The Memory Total Capacity is the amount of physical memory configured on descendant ESXi hosts and is calculated as the Sum([HostSystem]Memory|Total Capacity).

The Memory, measured in KB, is the memory utilization level based on the descendant Virtual Machines’ utilization.  This includes reservations, limits and overhead to run the Virtual Machines.  The Memory Utilization is calculated as the Sum([HostSystem]Memory|Utilization).

Projection is calculated based on the historical data, where we see the memory utilization varying through a wide range of values.  We even see some extreme data points where the memory usage has dropped and is not heavily used.  With vRealize Operations 7.0, we continue to build on the revamped capacity engine from the previous release and now utilize an exponential decay function to give more relevance to changing patterns and allow reacting better to more recent spikes without losing periodicity.  This improvement include giving more weight to the recent data points which improves accuracy and permits real-time self-learning with instant calculations.

If we zoom in on the Time Remaining graph, we see closer details to the date when the memory will run out.

Memory Constraint Graph Zoomed

We also see the vertical line is more prominent which represents the current day and is defined as Now.  This provides a quick reference point and provides a clear view when the memory will run out.  If we hover over a data point which falls between Now and when the Memory Runs Out, we see more details.

Memory Hover between now and runs out

We continue to see the Usable Capacity, but now we also see the projected information.

Projections are based on calculations from the historical data but as mentioned, vRealize Operations 7.0 applies higher weight to the more recent data allowing to be more forward looking to better predict incoming data. The capacity engine used until vRealize Operations 6.6.x (inclusive) was backward looking.

The capacity model now uses an “exponential decay” model also referred to as ‘online’, meaning that once new data comes in, it is processed immediately and the capacity output metrics such as Time Remaining, Capacity Remaining, Recommended Size are tuned.  Previously, the capacity model was an ‘open window’, which meant it used the past one month of data for calculating the capacity output metrics.

The main difference between vRealize Operations 6.7 and 7.0 is as follows:

  • vRealize Operations 6.7 uses an “open-window” model, meaning when doing capacity computations all historical data points that have been processed, have equal weightages.
  • vRealize Operations 7.0 uses an “exponential decay” model, meaning each newer data-point has higher weightage than its previous one. This makes the forecast to adopt to the utilization changes much faster.

The Projected Utilization Range is based on the mean value of the latest 6 months of historical data and using a radius of 90% to derive the upper and lower limits.

The Projected Utilization is defined as the upper limit of the Projected Utilization Range if the risk level is set to conservative in the policy.    With a mean value derived from the historical data, it is better to use the maximum to account for the worse possible case to allow for better planning.  Taking the maximum projected value prevents having a shortage of required resources.

Hovering over a point beyond when the Memory Runs Out, the same metrics will be displayed.

Memory Hover when memory runs out

Utilization shows how much resources are required to keep the workload happy and depends on the workload only. The capacity engine utilizes this data.

You must act on the recommendations to increase capacity (or reclaim resources or right size).

The analytics engine examines all the data points and disregards anomalous usage.

When viewing the forecast, it may be necessary to change the length of the history shown in the Time Remaining graph to see all the historical data being used for projection calculations.

Show History for 6 months

In some cases, there may be a drastic change in the earlier history which may cause the projected values to look false if not viewing all the actual data.

Memory Constraint Graph 6 Months

In the above graph, we now see another vertical line, which shows the projection calculation start point.  You should display as many months of historical data to see all the data used in the projection calculation to better understand the projected values.

 

Summary:

As you can see, there are many components and options available for vRealize Operations Capacity Management.  Having a better understanding of the capacity metrics and how they are displayed, can help you proactively maintain the resources needed to keep your environment healthy.