Resource contention is one of the most critical issues in any virtualized environment. When contention occurs, applications slow down and your users are affected. Up until now two different methodologies have been employed to mitigate the risk of contention, with varied results. But now I want to introduce you to the new “game changing” method available from VMware: Predictive DRS! But first a bit of a history lesson on the original two methods.
The first of these is the Reactive Method which focuses on resolving unexpected resource demand. The most widely used example of a reactive solution is VMware’s Distributed Resource Scheduler, or DRS. As the day progresses, workloads may need more resources, which can lead to contention on the host. The reactive method moves VMs around to ensure all workloads get the resources they need and applications remain healthy. Note this method needs only a minimal amount of VMs to be moved in order to be effective, which means minimal overhead. The reactive method only moves VMs when contention approaches, so it’s possible (however remote) for users to feel some effects of the contention before it’s resolved.
The next approach, the Balanced Method, has become more popular recently. This method focuses on balancing workloads across hosts and clusters to mitigate the risk of growing workloads. This can be an effective way to avoid hot spots by spreading workloads out evenly. However, the constant movement of VMs every day means the potential for overhead is huge. There is also no guarantee that a balanced environment will avoid all contention, so the Reactive Method is still needed here. VMware DRS can be configured to use the Balanced Method and, together with vRealize Operations, can balance workloads both within and across clusters in your environment. DRS and vR Ops allow you to set the amount of balance you are trying to drive, thus allowing you to control how much overhead you are willing to tolerate to obtain that balance. In vSphere 6.5, an additional policy is exposed in the DRS configuration screen that provides the option to choose balance based on number of VMs rather than load (although in cases of extreme resource contention, the VM number might become uneven in order to keep workloads happy). It’s important to note this method is very popular among a lot of niche vendor products. They may give it a fancy name to make it seem like it is doing something more, but in the end it falls into the balance bucket.
The newest approach is the Predictive Method, currently offered only through VMware‘s new Predictive DRS option. Predictive DRS using a combination of DRS and vRealize Operations Manager to predict future demand and determine when and where hot spots will occur. When future hot spots are found, Predictive DRS moves the workloads long before any contention can occur. Even better, this means with the Predictive Method only the required workloads are moved, resulting in minimal overhead.
So how does Predictive DRS work? It starts by leveraging the Dynamic Thresholds of vRealize Operations (one of the core functions of vRealize Operations), which understand the behaviors of all workloads throughout the day. vRealize Operations collects hundreds of metrics across numerous types of objects (Hosts, Datastores, VMs, etc) every day. Each night it runs its Dynamic Threshold calculations which uses sophisticated analytics to create a band of what is “normal” for each metric/object combination. The band has an upper and lower bound of normal for each metric associated with object. So for example if we have a simple app server VM, it will show the VM does not use a lot of CPU early in the morning. But at 8am when people start logging into the system, the CPU load will spike very high. It will then taper off around noon as people go to lunch, and then back up again for the rest of the day until people go home. And don’t forget about the nightly reports which run at 2AM and spike CPU.
The great thing about Dynamic Thresholds is that they are tailored to each individual VM and application. There is nothing you need to do; the analytical engine in vRealize Operations takes care of everything.
Once vRealize Operations has calculated its Dynamic Thresholds we have 3 fundamental data points:
- How many resources is each VM going to need throughout the day
- What VMs are on running on what hosts?
- How big is each host?
Once we have those we can ask the most important contention mitigation question of all, “Will any of my hosts struggle to serve my workloads today”? If the answer is “Yes” then let’s move a few VMs around to avoid that future contentious situation. That’s Predictive DRS in a nutshell!
Which Method Should You Use?
So Predictive DRS is a game changer, but in a complex datacenter like yours, all three risk mitigation methods are required to fully manage contention. Fortunately, VMware DRS and vRealize Operations offer all three, in one cohesive solution.
You do not have to settle for one or the other and most importantly you are in control with vRealize Operations and VMware DRS. For more information on the latest vRealize Operations release please read the vRealize Operations 6.4 blog.
13 comments have been added so far
Fantastic explanation! Very impressed. Lets get the word out how this is obviously the way all workloads should be reviewed in order to justify the proper combination.
What if “normal behavior” is utilization at 100%?
If the entire environment is at 100% for a given resource or for all of them? In the case that you have say 100% memory utilization across all the hosts, there would be core hypervisor tools that would kick in. Things like the balloon driver would kick in try and free memory pages so the host can reprioritize them to more important VMs (as defined by shares). Things like SIOC & NIOC would also come in for storage and network contention. All of those are leveraged by DRS for placement across, but frankly if you run out of a resource there isn’t much you can do short of stealing from less important VMs to give to higher priority ones. If you could give a bit more details on what you mean of 100% I could try to be more specific.
Re the 100% utilization question: If the anticipated utilization of a cluster is 100%, you can still move workloads around to make sure the load are balanced across cluster members and no single node takes on too much work, while others are “slacking off”. In fact, it’s when the anticipated workload is at or near 100% that predictive algorithms really shine, since they can proactively prepare for extreme utilization spikes