This is part 3 of the mini Tech Tip series for vSphere Capacity Planning. In part 1 of this series I covered the importance of using the right metrics and tuning the knobs to assess your capacity risk. In part 2 of this series I covered which visuals you can use in vCenter Operations to answer key questions around datacenter or cluster wide capacity usage and risk monitoring.
In this 3rd Tech Tip, I will cover some insights into right sizing your VMs.There are 2 areas that I’d like to share 1) How does vC Ops’ stress based analytics work to right size VMs/Clusters 2) Methodology you can use to identify, report and reclaim unused resources
Pretty much every customer I talk to has the potential to reclaim unused resources from their virtual infrastructure. One large enterprise that I recently talked to had more than 600 VM oversized out of their 1000 VM deployment that vCenter Operations was managing. This not only maps to vCPUs that you can reclaim but also unused storage which at this scale may be in PBytes that you can reclaim. Do the math and look like a hero…
Table for 2 vs Table for 8 ?
But, let me start with an analogy – When you go to a restaurant and ask for a table of 2 vs a table for 8- what do you expect? Either a ‘just-a-minute please’ or ’20 minutes please’.
Lets come back to the virtual world – So when you are asking for a 2 vCPU VM vs an 8vCPU VM – what do you expect? Well, the answer is the same. VMs with smaller vCPUs will get scheduled by the scheduler faster than larger vCPU VMs which the scheduler will take longer to find an opening for to schedule.
You want to see this for yourself, go ahead, open up ‘All Metrics’ tab in vC Ops and trend the metric CPU Usage| Co-stop along with Demand etc….
Takeaway – Giving more vCPUs to a VM does not necessarily increase its performance. Next time that App Owner comes asking you for more vCPUs, tell him/her this analogy … but wait even better, get them the real data below….
a) How do vC Ops’ stress based analytics work?
vC Ops uses pretty sophisticated analytics to figure out whether your VM is undersized for its expected peak usage period. It is not just threshold based or just peak based but does something smarter. It learns the behavior of your VM over time and creates a profile of say what a typical Monday 9-10 looks like. It then compares how long is the VM spending its time in the stress zone defined which is often defined by production or app teams – e.g. my VMs should not go above 70% CPU demand. If they do then the app may be incurring a performance degradation. See figure below. It does similar inverse analysis for if the VM is oversized
b) Simple step by step approach to right size
I recommend a simple approach to right size VMs
1) Identify 2) Profile & Tune 3) Report with historical CPU and Memory demand trends
In vC Opsà PlanningàViews, click on Oversized Virtual Machines view. You can also review the Capacity Optimization report under Reports. This gives you list of VMs oversized based on the stress analytics. Similarly, you can identify undersized VMs. You can tweak the configurations to drive the sizing recommendations in ‘Configuration’
Next step, review the Avg CPU Demand profile of the workload over time, under Operationsà All Metrics.
You can also view the Stress profile on the Dashboard. First heatmap below is an example of VMs in a Citrix farm – they are busy from 9-5. Would be similar for an Exchange, AD kind of server/employee driven workloads. Web servers on the other hand would probably all be yellowish green showing they are constantly busy… Simply apply the out of box provided and tuned – production policy to production workloads. Apply interactive policy to webserver kind of workloads if you like. The second heatmap shows a batch workload that is busy for only a few hours in the week- e.g. Backup server. It is also aligned nicely with its trend of avg CPU Demand in the graph below and you could choose to apply the batch workload policy.
Next you could simply report VMs and their usage trends to the app owner via custom dashboard like this one that you could share with your App owner/LOB team.
Don’t forget to check out our part 4 and final of this Tech Tip series on how to optimize your infrastructure utilization and realize savings!