Cloud Operations vRealize Operations

Capacity Management in SDDC – Part 6

In Part 5, I explained a new concept, where we use Contention as the basis of Capacity Management in SDDC. In this part 6, I will now provide the super metric formula for each charts. We will cover Tier 1, followed by Tier 2 and 3.

Tier 1 (Highest)

To recap, we do not have over-subscription in Tier 1. We only have it in Tier 2 and 3. As a result, it becomes simpler, as we are following Allocation model essentially.

You should be performing capacity planning at Cluster level, not Data Center or Host level.

Compute: CPU

Supply: Total physical cores of all ESXi Hosts – HA buffer

  • We can choose physical Core or physical Threads. One will be conservative, while the other aggressive. Ideal number is 1.5 of physical core. My recommendation: take the core, not the Threads. This is because it is Tier 1, your highest & best tier.
  • Threshold: 10% of your capacity, as it takes time to buy cluster (which also needs storage). You are also not aiming to run your ESXi at 100% utilization.
  • We do not have to build your threshold (which is your buffer actually) into the super metric formula as it’s dynamic. Once it’s hard coded in the super metric, changing it does not change the history. It is dynamic because it depends on business situation. If there is a large project going live in a few weeks, then your buffer needs to cater for it. This is why we need to stay close to the business. It’s also something you should know, based on your actual experience in your company. You have that gut feel and estimate.

Demand: Total vCPU for all the VMs.

  • If we are using virtual threads in your VM, then count them as if they are a full vCPU. For example, a VM with 2 vCPU and 2 threads per core should be counted as 4 vCPU.

Super Metric Formula: Supply – Demand

Compute: RAM

Supply: Total physical RAM of all ESXi Hosts – HA buffer

  • No need to include ESXi vmkernel RAM as it’s negligible. If you are using VSAN & NSX, you can add some buffer. You do not need to include virtual appliance as they take the form of a VM, hence it will be included in the Demand.
  • Threshold: set the name number, which is 10% in this example.

Demand: Total vRAM for all the VMs

Super Metric Formula: Supply – Demand

Compute: VM

Super Metric Formula: Max no of allowed VM in 1 cluster – No of VM in the cluster

  • I apply my Availability policy at cluster level since I think it’s makes more sense. Applying at ESXi Host level is less applicable due to HA. Yes, the chance of a host going down is higher than entire cluster going down. However, HA will reboot the VMs, and VM owners may not notice. On the other hand, if a cluster goes down, it’s a major issue.
  • The limitation of this super metric is it assumes your cluster size may vary. This is a fair assumption. You should keep things consistent. If for some reasons you have say 3 cluster sizes (e.g. 8, 10, 12), then you have 3 super metrics.

Compute: Summary

Look at the above 3 charts as 1 group. Take the one with the lowest number.

In emergency, temporary solution, you can still deploy VM while waiting for your new cluster to arrive. This is because you have HA buffer. ESXi host is known for its high uptime.

Storage

We have to measure both IOPS and Space. Take the lowest of these 2 dimensions, because adding one capacity gives you the other. This also keeps your storage in simple building block.

  • For IOPS, it is simpler. Just take the maximum and average latency. If the maximum is nearing your SLA, you need to buy more capacity. You can take the Maximum at Cluster level, or Datastore Cluster level.
  • For Space, it is more complex. Below is the formula for Space.

Supply: Total datastore space capacity in the cluster

You should be using Datastore Cluster. Other than the benefits that you get from using it, it also makes capacity management easier. If you are using it, you need not manually exclude local datastore. You also need not manually group the shared datastores, which can be complex if you have multiple clusters.

With VSAN, you only have 1 datastore per cluster and need not exclude local datastores manually. This means it’s much simpler in VSAN.

Include buffer for snapshot. This can be 20%, depending on your environment. This is why I’m not a fan of many small datastores, as you have pockets of unusable capacity. This does not have to be hardcoded in your super metric, but you have to be mentally aware of it. If you need a visual reminder, chapter 8 of my book has a heat map sample to track it.

Storage space should be tied with your actual, physical capacity. If you are doing thin provisioning at the storage layer, then you need to measure it at this level. I prefer to use thin on VMware, and thick on physical array.

Demand: Total Storage consumed by all VMs in the shared datastore

The Total vDisk depends if you are doing thin provisioning or not.

If you are not, then it is simple. Just total all the storage consumed by all VMs.

If you are, you will have 2 numbers. One for Configured and one for Utilized. The number you need is somewhere in between. You need to make a business call where you want to take it, as it depends on your environment. If the disk growth is relatively modest, then you can take closer to Utilized. If not, take closer to Configured.

Super Metric Formula: Supply – Demand

The above will give you the information you need for Tier 1. Tier 2 and 3 will be different, as there is over-subscription. This means we cannot ignore contention.

Tier 2 and 3 (lowest)

Compute: Summary

Super Metric Formula: Maximum (VM CPU Contention) in the cluster

Super Metric Formula: Average (VM CPU Contention) in the cluster

Super Metric Formula: Maximum (VM RAM Contention) in the cluster

Super Metric Formula: Average (VM RAM Contention) in the cluster

For the total number of VM left in the cluster, see Tier 1. It’s the same formula, just a different policy.

Storage

See Tier 1. It’s the same set of super metrics, just a different policy.

Network (all tiers)

Super Metric Formula: Max (VM Network Drop Packets) at the physical DC level

You should expect 0 drop packet in entire data center.

Super Metric Formula: Max (ESXi Host vmnic utilization) in the physical DC.

This number has to be below your physical capacity. Ideally, it has buffer so it can handle spike from network intensive events.

Super Metric Formula: Average (ESXi Host vmnic utilization) in the physical DC.

Conclusion

Indeed, a few line charts is all you need to do capacity management in SDDC. I am aware it is not a fully automated solution. However, my customers found it logical and ease to understand. It is following an 80/20 principle, where you are given the 20% room to make the judgement call as the expert.

We will cover the actual super metric examples, in part 7, scheduled for publication this month.