Evaluate vSphere with Operations Management to deliver virtualization with consistent management
In our previous post, we shared that capacity management in the software defined data center needs to be split into two aspects:
- VM level
- infrastructure level.
In this post, I will cover capacity management at VM level. I shared earlier that this aspect should be managed by the application team (which is the customer of infrastructure team).
There are some tips you can give to your customers and policies you can set up to keep things simple. To begin with, keep the building blocks simple—one VM, one OS, one application, and one instance, as shown in the following diagram. So, avoid having one OS running the web, app, and DB server or avoid having one Microsoft SQL server running five instances of databases. The workload can become harder to predict as you cannot isolate them. It is recommended to adjust the size of the peak workload for a production environment. A month-end VM needs to be sized based on the month-end workload. For a non-production environment, you may want to tell the application team to opt for a smaller VM, because the vSphere Cluster where the VM is running is oversubscribed. The large VM may not get the CPU it asks for if it asks for too many.
Be careful with those VMs that have two distinct peaks: one for CPU resources and another one for memory resources. I have seen this with a telecommunications client running Oracle Hyperion. For example, the first peak needs 8 vCPUs and 12 GB vRAM, and the second peak needs 2 vCPUs and 48 GB vRAM. In this case, the tendency of the application team is to size for 8 vCPUs and 48 GB vRAM. This results in an unnecessarily large VM, which can result in poor performance for both peaks. It is likely there are two different workloads running in the VM, which should be split into two VMs.
You also need to size correctly. Educate the application team that oversizing results in slower performance in the virtual world. Although I encourage the standardization of VM size to make life simple, you should be flexible for large or extra-large cases. For example, once you pass 8 vCPUs, you need to consider every additional CPU carefully, ensure the VM really needs it, and ensure the application can indeed take advantage of the extra threads. You also need to verify that the underlying ESXi has sufficient physical cores, as it will affect your consolidation ratio, and hence your capacity management. You may see an ESXi that is largely idle yet the VMs on it are not performing, therefore impacting your confidence about adding VMs.
At the VM level, you need to monitor the following five components for its infrastructure portion:
- Virtual CPU
- Virtual RAM
- Virtual network
- Virtual disk IOPS
- Usable disk capacity left in the Guest OS.
Getting vCPU and vRAM into a healthy range requires finding a balance. Undersizing leads to poor performance and oversizing leads to monetary waste as well as poor performance. The actual healthy range depends upon your expected utilization, and it normally varies from tier to tier. It also depends on the nature of the workload (online versus batch). For example, in tier 1 (the highest tier), you will have a lower range for the OLTP type of workload as you do not want to hit 100% at peak. The overall utilization will be low as you are catering to a spike. For batch workload, you normally tolerate a higher range for long-running batch jobs, as they tend to consume all the resources given to it. In a non-production environment, you normally tolerate a higher range, as the business expectation is lower (because they are paying a lower price).
Generally speaking, virtual network is not something that you need to worry about from a capacity point of view. You can create a super metric in vRealize Operations that tracks the maximum of all of your vNIC utilization from all VMs. If the maximum is, say, 80%, then you know that the rest of the VMs are lower than that. You can then plot a chart that shows this peak utilization in the last three months. We will cover this in more detail in one of the use cases discussed in the final chapter.
You should monitor the usable disk capacity left inside the guest OS. Although vCenter does not provide this information, vRealize Operations does—provided your VM has VMware tools installed (which it should have as a part of best practice).
You should use Reservation sparingly as it impacts the HA slot size, increases management complexity, and prevents you from oversubscribing. In tier 1, where there is no oversubscription because you are guaranteeing resource to every VM, reservation becomes unnecessary from a capacity management point of view. You may still use it if you want a faster boot, but that’s not from a capacity point of view. In tier 3, where cost is the number-one factor, using Reservation will prevent you from oversubscribing. This negates the purpose of tier 3 in the first place.
You should avoid using Limit as it leads to unpredictable performance. The guest OS does not know that it is artificially limited.
I hope you find it useful. We will cover capacity management at the Infrastructure level in my next blog post.
This post is adapted from the vRealize Operations Performance and Capacity Management book by Iwan ‘e1’ Rahabok.