Disclaimer:

The content on this blog is no longer accurate. It was written for vSphere releases that were supported at the time of its writing (2016). Please see the following blogs for up to date information:

https://blogs.vmware.com/vsphere/2020/03/vsphere-7-improved-drs.html

https://blogs.vmware.com/vsphere/2020/05/vsphere-7-a-closer-look-at-the-vm-drs-score.html

Recently, a customer reported that DRS was not working to load balance the cluster. Under normal circumstances, a minor imbalance is nothing to be concerned about. This is because the main objective for DRS is not to balance the load perfectly across every host. Rather, DRS monitors the resource demand and works to ensure that every VM is getting the resources entitled. When DRS determines that a better host exists for the VM, it make a recommendation to move that VM.

However, some customers still prefer to have an even distribution of utilization across all hosts within a cluster. This article is intended to provide recommendations to accomplish this goal, bearing in mind that in most cases this will result in additional vMotion activity.

Migration Threshold

This threshold is a measure of how much cluster imbalance is acceptable based on CPU and memory loads. The slider is used to select one of five settings that range from the most conservative (1) to the most aggressive (5). The further the slider moves to the right, the more aggressive DRS will work to balance the cluster.

A priority level for each migration recommendation is computed using the load imbalance metric of the cluster. This metric is displayed as Current Host Load Standard Deviation in the cluster’s Summary tab in the vSphere Web Client.

A cluster with a higher cluster imbalance will lead to higher priority migration recommendations.

For more information about this metric and how a recommendation priority level is calculated, see the VMware Knowledge Base article “Calculating the priority level of a VMware DRS migration recommendation.”

When the migration threshold is set to a more aggressive setting, the Target Host Load Deviation value is lowered. The more aggressive the migration threshold, the smaller the target range. As long as the current load standard deviation is less than or equal to the target host load value, DRS will consider the cluster balanced.

If you are viewing the CPU and MEM utilization from the vSphere Web Client, and conclude that your cluster is not as balanced as you’d like, check your migration threshold setting in your DRS enabled cluster to ensure it’s not set to a value that is too conservative. Simply moving this slider to a more aggressive threshold will lower the target standard deviation value, and cause DRS to execute more migrations to achieve a better load balance.

Setting this value to Priority 4 (the second to the furthest right setting) offers a good balance for those that wish to have even cluster balance without executing too many migrations. Setting a Priority 5 will offer the most even load balance, but will result in frequent vMotions that may not provide any performance benefit to the VMs.

Active vs. Consumed Memory

Have you ever looked at the host utilization for MEM in the vSphere Web Client and wondered why the hosts appear to be imbalanced while DRS says it is balanced? Well, there is a reason for that. The vSphere Web Client is showing consumed memory, while DRS uses active memory when calculating the current host load. This disparity is especially common in environments where VM memory is mostly idle, but the VM once consumed most of its allocation. The consumed memory metric is the maximum amount of memory used by the VM at any point in its lifetime, even if the VM is not actually using most of this memory. By contrast, active memory is an estimation of the amount of memory that is currently actively being used by the VM; it is estimated using a memory sampling algorithm which is run every 5 minutes.

The consumed memory metric is not the most accurate way to measure the effectiveness of DRS. To get a better indication of the current host load, one would need to look at the active memory counter for each host to decide if the hosts loads are balanced with one another. Specifically, DRS uses (Active Memory) + (25% of Idle Memory) to determine the value for the current host load.

Starting with vSphere 5.5, an advanced setting was introduced called PercentIdleMBInMemDemand that can be used to change this behavior. Increasing this value will cause DRS to use more of the consumed memory value when making DRS calculations. This often results in higher priority recommendations to be generated. The higher the priority, the more likely DRS will move the VM to another host in order to achieve a better cluster balance.

A valid value for PercentIdleMBInMemDemand can be from 0 to 100. 0 would cause DRS to use only active memory for all calculations, and 100 would cause DRS to use only Consumed memory. In recent lab tests, setting PercentIdleMBInMemDemand=50 exhibited very positive results. Changing this value increased the number of higher priority recommendations where the default migration threshold (3) would take action to move the VMs. This also resulted in a more even balance of cluster load as represented in the vSphere Web Client since DRS is using a value closer to the consumed memory metric.

Setting PercentIdleMBInMemDemand to 100 can also be done, which would cause DRS to use consumed memory. This is the same value as displayed by the vSphere Web Client. However, this should only be set to 100 in environments where memory is not over-committed, otherwise unexpected results may occur.

Bursty CPU Workloads

AggressiveCPUActive is another setting that was first introduced in vSphere 5.5. This setting is intended to improve the CPU load balance in environments where the CPU is very active but spikey in nature. DRS will normally use the average over the past 5-minute period for CPU demand calculations. When the AggressiveCPUActive advanced setting is enabled, DRS will switch from using the average to the 80th percentile (i.e., the 2nd highest value in the interval). This will help some situations where DRS is not generating recommendations to move a VM based on CPU demand because the average demand in the past 5-minute period is much lower than needed to generate the recommendation.

Here is an example to the benefit to this advanced setting. Over a 5-minute period, a VM uses 50%, 70%, 10%, 5%, and 5%. This VM is very spikey in nature where the VM only demands a lot of CPU for a very short period of time, and then moves to more of an idle state. The average 5-minute utilization is 28%, which may not be high enough to warrant a recommendation to move the VM to another host. However, when enabling the AggressiveCPUActive setting, DRS will use 50% which is the 80th percentile.

Recommendations

To summarize, achieving a perfectly even load balance of hosts is not the primary intention of DRS. However, it can be configured to achieve this result at the expense of additional vMotion migrations.

Migration Threshold=4 | This is the first step to gain better load balance. This will cause priority 4 recommendations to be migrated to other hosts in an effort to gain higher levels of load balance.
PercentIdleMBInMemDemand=50 | This will instruct DRS to factor twice as much idle memory as the default when generating calculations for the host load. This uses values more closely resembling the values shown in the vSphere Web Client, and results in higher priority recommendations to be generated. A combination of the two above settings will cause even more recommendations to be generated. This will result in a very even balance to the hosts.
AggressiveCPUActive=1 | When used in an environment where CPU is spiky and could otherwise be missed. If you are seeing increased ready time in your VMs and DRS is not generating recommendations, this setting could improve the DRS effectiveness for these types of workloads.

If you choose to NOT change any default settings for DRS, then you can rest assured that DRS is still working hard behind the scenes to ensure that all your VMs are getting as much of the resources that they are entitled.