A couple of years ago, I wrote a blog post about rightsizing VMs with VMware Aria Operations, formerly known as vRealize Operations. That blog post has been very popular, but new features have been released since then, improving rightsizing even more. This blog post is a refresh of that blog post to include all the awesome capabilities related to rightsizing that are available as of January 2023 with vRealize Operations Cloud and vRealize Operations 8.10.1.
Rightsizing VMs is critical to get the best performance of your vSphere infrastructure and your VMs. Rampant oversizing of VMs can cause contention at the host or cluster level, which manifests as CPU ready, CPU co-stop, VM swap, etc. Undersized VMs can cause contention inside the guest OS, which manifests as CPU queuing, memory paging, etc. Rightsizing VMs helps you achieve the best performance of the infrastructure and VMs. In this blog post, I’d like to show you why I feel that VMware Aria Operations is the best tool available for VM rightsizing.
The first thing I’d like to cover is the difference between rightsizing and reclamation. Rightsizing is when you change the resources allocated to a VM to match the utilization requirements of the VM. Rightsizing VMs covers both oversized and undersized VMs. For example, adding a vCPU if the VM is running high CPU utilization or removing memory if the server is not using all its allocated memory. Reclamation, on the other hand, covers changes that free up physical capacity such as deleting powered off or idle VMs, old snapshots, or orphaned disks. The main difference is that rightsizing is done primarily for performance reasons and reclamation is done primarily for capacity reasons. This blog post is dedicated to rightsizing oversized and undersized VMs using VMware Aria Operations.
To start on your rightsizing adventure, you should start on the Rightsizing page once you logon to VMware Aria Operations.
Once you’re on the Rightsizing page, you will be presented with a summary of Oversized and Undersized VMs. By expanding the Oversized VMs section at the bottom, you can see all the VMs that have been identified as oversized. You can select one or more VMs and resize them without leaving VMware Aria Operations by clicking on the Resize VM(s) button or the schedule button. By initiating the resize action from here, VMware Aria Operations automatically uses its connection to vCenter to make the changes to the VM. It’s even aware of hot-add and will skip a reboot of the VM if hot-add is enabled.
If by chance, you see a VM that you want to leave oversized and you don’t want to be notified about it anymore, just select the VM and click Exclude VM(s). If you have a lot of VMs that you wish to exclude, you can use the Filter box to reduce the list of VMs shown (e.g VMs containing “xyz” in their name), then use the select all button to exclude VMs in bulk. You can always include the VM again by expanding Show Excluded VMs at the bottom of the page.
By clicking on the Undersized VMs section at the bottom, you can see all the VMs that have been identified as undersized. This page works the same as the Oversized VM section. Click the Resize VM(s) button or schedule button to resize the VMs in vCenter and Exclude VM(s) to remove them from the list.
Capacity Projections for VMs
The capacity engine built into VMware Aria Operations leverages AI/ML technologies to create forward-looking projections of the future utilization of VMs. It’s those projections that are used to determine the Recommended Size of VMs. The recommendations are not simply based on the historical utilization of the VM. Details on how recommended size is projected and the configuration settings are covered later in this blog. When reviewing capacity projections for VMs, I recommend looking at 6 months or more of historical data as there could be some historical demand or usable capacity changes that impact the projections.
For VMs, there are three different projections generated: Time Remaining, Capacity Remaining, and Recommended Size. You can view these projections on the capacity tab of any virtual machine object. The capacity projections for VMs are based on demand, which is what usage would have been if there was no contention. Specifically, the metrics that represent demand are:
CPU|Demand (MHz): This represents what CPU usage would have been if there was no contention like ready, co-stop, or hyper-threading.
Memory|Utilization (KB): Memory usage as seen from inside the guest OS, which also accounts for guest memory swapping/paging. I have a separate blog post on Enhancements to Virtual Machine Memory Metrics in vRealize Operations which goes into more depth on the various memory metrics.
Time Remaining is the projected number of days until the VM will run out of capacity. Time Remaining is an important concept for rightsizing because you don’t want to get to a situation where you have 0 days of capacity remaining.
Capacity Remaining is the minimum capacity left projected three days in the future. It’s important to understand that Capacity Remaining isn’t just free resources, it’s based on a projection 3 days in the future.
Recommended Size is the recommendation size from the capacity projections. This is the value that is converted vCPUs and GB of memory to change on the VM that is shown on the Rightsizing page that I mentioned in the previous section. What sets the rightsizing recommendations from Aria Operations apart from other solutions is that it’s based on forward-looking projections. This means it accounts for projected growth, unlike other solutions that only look at historical usage. The methods used to project the recommended size for VMs will be covered later in this blog post.
What’s important to understand with rightsizing recommendations is how it relates to the capacity projections and that you can view those capacity projections on the capacity tab of the VM. If you are looking for evidence to justify rightsizing recommendations, the capacity tab of the VM is a great place to start.
VM Rightsizing Details Dashboard
While the Rightsizing page is a great place to view oversized and undersized VMs and the capacity tab to see capacity projections, there are additional questions related to rightsizing that need to be answered. If you read my previous blog post, this dashboard will look familiar. I’ve worked with numerous customers and developed a dashboard that answers most rightsizing questions. One of my main goals with the dashboard is to be able to present enough evidence that you can present to a VM’s owner to get approval to rightsize the VM. The dashboard has been available out of the box for several releases now, so it no longer requires any setup to use it.
I made the VM Rightsizing Details dashboard a one-stop shop for rightsizing. There is a lot of information on the VM Rightsizing Details dashboard, but the most notable are the reclaimable capacity and potential cost impact due to rightsizing. There are nuances to reclaimable capacity due to rightsizing that can be counterintuitive because it’s the amount of physical resources freed due to rightsizing, not the change in resources allocated to the VM. Reclaimable capacity is what drives the potential cost impact of rightsizing. Check out the Reclaimable Capacity and Potential Cost Impact sections below to learn more about how they’re determined.
If you prefer reports, there are two out of the box reports called Optimization Report – Oversized Virtual Machines and Optimization Report – Undersized Virtual Machines related to rightsizing. These reports can generate CSV or PDF and reports can be run on demand or scheduled for delivery as an email. There is also the option to create custom reports if you’re looking for other information not included in the out of the box reports.
Now that we know which VMs need to be rightsized, it’s time to take some action on the recommendations. There are several options that range from manual to self-service to fully automated, and if none of those suit your needs, there is a fully customizable option too. I recommend reviewing each of these options to determine which option fits within your business requirements and policies.
The Rightsizing page is where you can manually initiate rightsizing. All you need to do is select one or more VMs and click Resize VM(s). You have a chance to accept the recommendation or modify them as you see fit. The Rightsize VM(s) action performed from the Rightsizing page is performed directly in vCenter.
The other option on the Rightsizing page is to select Schedule Action, which as the name suggests, allows you to schedule rightsizing to occur at a future time. The Schedule Action performed from the Rightsizing page is performed directly in vCenter.
Automation Central is the option I prefer because it allows full automation of rightsizing. With Automation Central, you can schedule many different actions, which include rightsizing VMs. When you create a new job, you are presented with a wizard that has 3 steps.
Step 1 defines the action for the job like downsize oversized VMs or scale-up undersized VMs. You can select whether you want to rightsize CPU, memory, or both as well as specify whether you want to use the default recommendation or be more conservative by using 50% of the default recommendation.
Step 2 is where you define the scope. You can select which datacenters or clusters to target and define criteria that the VMs must meet. There are many options for criteria covering metrics, relationships, properties, object names, and tags.
Step 3 is where you define the schedule for the job, which includes options like start date and time, recurrence (one-time, daily, weekly, or monthly) and whether you would like to receive email notifications.
If you have VMware Aria Automation (vRealize Automation), you can empower VM owners to rightsize their own VMs via the Service Catalog. To rightsize VMs within VMware Aria Automation, open the deployment in the Service Catalog, and look for the Optimization tab. The Optimization tab shows you the reclamation and rightsizing recommendations from VMware Aria Operations. The user can initiate rightsizing right from the Service Catalog without needing to interact with another team.
Bring Your Own
Lastly, if you have a more advanced situation that’s not covered by the Rightsizing page, Automation Central, or vRealize Automation, you can write your own rightsizing actions. The easiest option is to create an alert based on Summary|Is Oversized ==1 or Summary|Is Undersized ==1. Now that you have an alert, you have to decide what you want to call when a VM is identified as oversized or undersized.
If you use VMware Aria Automation Orchestrator (vRealize Orchestrator), you can leverage the Management Pack for vRealize Orchestrator to call a workflow when the alert fires. VMware Aria Automation Orchestrator is a very powerful orchestration engine that allows you to write pretty much anything you can imagine.
The other option is to create an alert notification that calls a webhook. Webhooks are a generic way to call many different first and third-party systems. Once you have a system that can receive the webhook, you can run any action that the receiving system is capable to do.
Configuration Settings for Rightsizing
Now that I’ve shown how rightsizing works in the UI and how you can act on the rightsizing recommendations, I’d like to explain how VMware Aria Operations projects the recommended size for VMs and what settings you can configure. The capacity engine built into VMware Aria Operations leverages AI/ML technologies to create forward-looking projections of the future utilization of VMs. It’s those projections that are used to determine the Recommended Size of VMs. The recommendations are not simply based on the historical utilization of the VM.
The settings that control rightsizing are available in policies. Some of the policy settings affect all object types, so if you want different settings for VMs than other objects, like clusters, I suggest making a separate policy for your VMs. More details on configuring policies are available here.
Time Remaining Criticality Threshold
Time Remaining is when the VMs demand is projected to exceed the total capacity of the VM in days. Time Remaining Criticality Threshold controls when the system will flag Time Remaining as being in warning, immediate, or critical to give you a proactive warning before the VM runs out of capacity. The default is 30 days for yellow (warning) and 10 days for red (critical).
Time Remaining Criticality Threshold is also used to control how much of the projected demand to consider when projecting the Recommended Size. Recommended Size is determined by considering the peak projected demand between now and 30 days beyond the Time Remaining Criticality Threshold. Since the default warning threshold is 30 days, that means Recommended Size also defaults to 60 days in the future (30 days from Time Remaining Criticality Threshold + 30 days).
To determine the appropriate values for Time Remaining Criticality Threshold, consider how often you plan to rightsize your VMs and set it to a value that gives you enough lead time to rightsize the VMs.
Time Remaining Risk Level
The projections from the AI/ML capacity engine includes a projection range which has an upper bound and lower bound. Within the projection range, you’ll see a solid line, you’ll see a solid line that represents the projection. By selecting Conservative, Time Remaining will be determined by looking at the intersection of the upper bound of the projection range and the Usable Capacity of the VM.
To project the Recommended Size with the Conservative setting, the upper bound of the projection range is used in conjunction with the Time Remaining Criticality Threshold + 30 days (as mentioned in the previous section). The peak of the projection within that time range becomes the Recommended Size for the VM.
Conservative is the default and is recommended for critical environments such as production or business-critical applications.
In addition to setting the time remaining risk level, there is a setting to control the conservativeness strength level when the time remaining risk level is set to conservative. The conservativeness strength level can be configured from levels 1 to 5, where level 1 is the least conservative and level 5 is the most conservative. Conservativeness strength level 3 is the default, which behaves the same as the conservative risk level in previous releases. What you will notice is that changes to conservativeness strength level change the size of the projection range.
If you feel that the capacity projections don’t account for enough of the peaks in the historical demand, you can increase the conservativeness level to 4 or 5. For less critical environments, such as development or test, you can lower the conservativeness level to 1 or 2 where more risk can be tolerated.
By selecting Aggressive, Time Remaining will be determined by looking at the intersection of the mean of the upper and lower bound of the projection range and the Usable Capacity of the VM.
To project the Recommended Size with the Aggressive setting, the mean of the upper and lower bounds of the projection range is used in conjunction with the Time Remaining Criticality Threshold + 30 days (as mentioned in the previous section). The peak of the projection within that time range becomes the Recommended Size for the VM.
Aggressive is not the default, but it can be enabled for less critical environments such as development, UAT, or test.
Enabling peak focused mode causes the capacity engine to create projections using the peaks that have been identified in the historical demand. Peak focused mode is supported with both Conservative and Aggressive risk levels. Peak focused mode works best for workloads that have sporadic peaks
Some customers prefer to do capacity management for some workloads based on a portion of the day and to effectively ignore the rest of the day. An example of this would be to concentrate on the hours of the day the business is open and ignore after-hours when backups, patching or AV scans typically run. In VMware Aria Operations, this is called business hours. You can configure it to be 8 AM – 5 PM, Monday through Friday, which will only use demand during those business hours to create the projections.
If you need different business hours for different VMs, you can create multiple policies with different business hours per policy. You have the option to use policies to apply different business hours to VMs for rightsizing versus clusters for capacity management. Changes to business hours will cause the projection, including rightsizing recommendations, to be updated to reflect the newly configured business hours.
Recommended Size Calculations
Recommended Size Limits
Customers often aren’t keen on making substantial changes to VMs and are looking for a more conservative approach. Recommended Size has been designed to be conservative in its recommendations as well.
Recommended Size for oversized VMs is capped at 50% of the current configuration while Recommended Size for undersized VMs is capped at 100% of the current configuration. This helps to gradually guide VMs to the Recommended Size without recommending substantial changes like 32 vCPUs down to 1 vCPU.
Projection Calculation Start Point shows how much data is used to create the projection. It’s possible to reset the projection calculation start point. Please see the Reset Capacity Projections section below for more details.
Exponential Decay gives higher weight to recent data points to allow the projection to react more quickly to recent changes in utilization.
As you know, most workloads don’t follow a straight line for utilization. There can be various peaks in utilization over time that need to be accounted for in the projections. The impact of a peak on the projection is relative to the peaks’ duration, height, and frequency. Remember this is a projection created by the AI/ML powered capacity engine, so there isn’t a specific formula that I can give you to doublecheck the math. The way I like to explain it is, as a human looking at the historical utilization, ask yourself if the peak looks significant enough to affect capacity planning and are there enough peaks that appear to follow a periodic pattern(s). If so, you should see the impact of those peaks in the projections. In general, the more important the peak(s) look, the more impact the peak(s) have on the projection.
- Momentary peaks that are short-lived and might be one-offs. These are the peaks that you would dismiss for capacity planning purposes because they don’t appear important. In general, small and short-lived peaks should have minimal impact on capacity planning and therefore have minimal impact on the projection.
- Sustained peaks last for a longer time and do impact projections. If the peak is not periodic, the impact on the projection lessens over time due to exponential decay.
- Periodic peaks exhibit cyclical patterns or waves. For example, hourly, daily, weekly, monthly, last day of the month, etc. There can be multiple overlapping cyclical patterns, which will also be detected.
Reset Capacity Projections
Some significant changes in demand or usable capacity for an object may temporarily throw off the projections. If you don’t want to wait for exponential decay to learn the new normal, you can reset capacity projections from any day you choose. Once you reset capacity calculations, you will notice that the Projection Calculation Start point on the capacity charts will reflect the date that you selected. Capacity projections will start from that day forward.
The screenshot below shows the impact of restarting capacity calculations from October 24th, 2022 because there was a significant change in demand on October 23rd, 2022.
Recommended Size Metrics
There are 8 key metrics related to rightsizing VMs to be aware of. You can use these metrics to create custom dashboards, views, and reports.
Capacity Analytics Generated|CPU|Recommended Size (MHz): The recommended amount of CPU Usable Capacity in MHz needed to maintain a green state for the entire time between now and the Green Time Remaining Score Threshold set in the policy + 30 Days.
Capacity Analytics Generated|Memory|Recommended Size (KB): The recommended amount of Memory Usable Capacity in KB needed to maintain a green state for the entire time between now and the Green Time Remaining Score Threshold set in the policy + 30 Days.
Summary|Is Oversized: If a VM is detected to be oversized for at least one resource type (CPU or Memory), the value will be set to 1. Otherwise, the value will be 0.
Summary|Is Undersized: If a VM is detected to be undersized for at least one resource type (CPU or Memory), the value will be set to 1. Otherwise, the value will be 0.
Summary|Oversized|Virtual CPUs: The recommended number of vCPUs to remove from an oversized VM.
Summary|Oversized|Memory (KB): The recommended amount of memory in KB to remove from an oversized VM.
Summary|Undersized|Virtual CPUs: The recommended number of vCPUs to add to an undersized VM.
Summary|Undersized|Memory (KB): The recommended amount of memory in KB to add to an undersized VM.
Capacity Models for Clusters
So far, everything that I’ve talked about applies to the VM level. The remaining topics bring that viewpoint up to the cluster level and higher. When managing capacity at a cluster level, there are two different capacity models that can be used, called demand model and allocation model. The reason why the capacity model for a cluster is important for rightsizing is that quantification of reclaimable capacity and potential cost impact are both dependent on the capacity model used for the parent cluster, not the VM.
Demand Model for Clusters
Demand model looks at what usage would have been if there was no contention. For CPU demand, examples of contention are ready, co-stop, hyperthreading, frequency scaling, etc. For memory demand, example of contention is paging or swapping, but more specifically swap/page in. Demand model for a cluster is always active and looks at the aggregate demand of all VMs in the cluster to determine the cluster’s capacity.
Allocation Model for Clusters
Unlike demand model, allocation model is optional. Enabling allocation model involves defining overcommit ratios such as 8:1 vCPU:Core. With allocation model, the utilization or demand of the workloads is not considered. It looks at the aggregate of resources provisioned or configured to VMs compared to the overcommit ratios defined. If you want to learn more about allocation model, check out this blog.
Reclaimable Capacity for Demand Model
I am often asked to help quantify the overall impact on capacity if all VMs are rightsized when using the demand model. Answering that question is covered by the VM Rightsizing Details dashboard mentioned above. The key part to understanding reclaimable capacity, when the parent cluster is using demand model, is that the change to cluster capacity does not always correlate with the change in configured resources for the VM.
These reclaimable capacity metrics apply to VMs running in clusters running with demand model only. You can read more about demand and allocation models in this blog.
Summary|Oversized|Potential Memory (GB): An oversized VM can have reclaimable memory only if consumed memory is greater than the new recommended size of the VM. The reclaimable memory capacity is the difference between consumed memory and recommended size.
Summary|Undersized|Potential CPU Usage (GHz): CPU Usage of a VM of an undersized VM is expected to be the current CPU Demand. The difference between CPU Demand and CPU Usage is the expected increase in capacity utilized after rightsizing.
Summary|Undersized|Potential Memory (GB): It can be expected for consumed memory to increase by the same amount of memory recommended to add to an undersized VM.
Some of you may have noticed that I didn’t mention CPU for oversized VMs. If an oversized VM’s CPU usage is 100MHz before rightsizing, removing vCPUs won’t change its CPU usage so the VM should have usage of 100MHz after rightsizing. This means there is no reclaimable capacity associated with the overallocation of vCPUs. Reclaimable CPU Usage for oversized VMs will always be 0 MHz regardless of how many vCPUs are recommended to remove.
Potential Cost Impact
The other question I get frequently is quantifying the potential cost impact of rightsizing. Answering that question is also covered by the VM Rightsizing Details dashboard mentioned above.
Calculating the potential cost can be utilization or allocation based, depending on whether allocation model is enabled for capacity. You can read more about demand and allocation models in this blog.
Demand model can cause some confusion when looking at cost impact because demand model is looking at the physical resources used by the VM, not the resources configured on the VM. What’s important to understand is that the cost impact is determined using the reclaimable capacity mentioned in the previous section and the CPU and memory base rates for the parent cluster or host.
Allocation model is what most customers tend to think of when looking at the cost impact of rightsizing. Since allocation model is based on the resources configured on a VM rather than the physical resources used by a VM, the cost impact is determined using the recommended CPU and memory changes and the allocation model CPU and memory base rates for the parent cluster or host.
Cost Impact Metrics
Cost|Potential Savings: This metric contains all of the potential savings opportunities for the VM. This can be from being oversized or reclaimable from being powered off, idle, or having old snapshots. If allocation model is enabled, potential savings will be based on allocation model. Otherwise, it will be based on demand model.
Cost|Potential Cost Increase: This metric shows the potential cost increase from rightsizing the VM if it’s undersized. If allocation model is enabled, potential cost increase will be based on allocation model. Otherwise, it will be based on demand model.
Calculating the Recommended Size for VMs should be less of a mystery now. I hope this explanation of how Recommended Size is calculated helps earn your trust in the recommendations offered by VMware Aria Operations and helps empower you to have the rightsizing conversations with your VM and application owners. All of this is to achieve the best performance of your vSphere infrastructure and your VMs, reclaim unused capacity, and quantify cost savings.
If you’re not a VMware Aria Operations customer, you can sign up for an evaluation copy or sign up for a trial of the SaaS edition at VMware Aria Operations.
You can learn more about other VMware Aria Operations features on TechZone.