Cloud Operations vRealize Operations

Capacity Management in SDDC – Part 5

Get the best performance and efficiency out of your vSphere infrastructure

Conduct a free trial today by installing vSphere with Operations Management in your environment.

Capacity Management policy is interlinked to the Performance Management policy and Availability Management policy. As shown in the diagram below, both Performance and Availability management drive your Capacity Management. The Capacity Management threshold is the lower of the 2. For both policies, you naturally have different service tiers. The availability of a mission critical VM is certainly much higher than a development VM.  The same goes for performance. You will not accept any form of resource contention for mission critical VM, but will accept contention in development environment as cost is more important.

Capacity Management 01

It is common for an enterprise to have 3 service tiers. For simplicity, I will call them:

  1. Tier 1: This is the highest, most important tier. All your mission critical VMs are placed here.
  2. Tier 2: This is the middle tier. Majority of production VMs are placed here.
  3. Tier 3: This is the lowest tier. Majority of test & development VMs are placed here.

Avoid having more than 3 tiers. Even in a large environment (>100,000 VM), keep it at 3 tiers. The more tiers, the more confusing it is for your customers (the Application team). The positioning of each tier must be clear. Having too many tiers blurs this positioning.

Performance: Service Definition

Let’s look at an example of a Performance service tier. I said an example as your policy as IaaS provider may vary. You need to describe (or define) the service for each of the 4 infrastructure component (CPU, RAM, Disk, and Network). For each, list all the properties that impact the quality of the service. The table below provides an example of server VM. For VDI, we need to have a different definition.

Let’s go through the above table.

  • CPU and RAM
    • Notice I do not have over subscription ratio. I do not define something like “1.5x CPU Over Subscription” or “2x RAM Over Subscription”. This is because over subscription is an incomplete policy. It fails to take into account utilization. I’ve seen this in customers, where the higher tier perform worse than the lower tier. Once you oversubscribe, you are no longer able to guarantee consistent performance. Contention can happen, even if your ESXi utilization is not high.
    • I use contention to quantify the SLA. The chance of contention goes up as the tier gets lower. Tier 3 has a higher threshold.
    • Tier 1 has no over subscription. There is enough CPU and RAM for every VM in the host. No VM needs to wait or contend for resource. As a result, reservation is not applicable.
    • I specify that all the hosts in Tier 1 cluster are also identical. That means the CPU generation and speed are identical. This makes performance predictable. I do not make such guarantee in Tier 3. The cluster may start with 4 identical nodes, but over time, it may grow into 16 nodes. The 16-node is certainly not identical in terms of performance as the new nodes will sport faster CPU.
  • Storage
    • The performance SLA is set at 10 ms. I use a 5 minute average as this is a good balance.
    • In Tier 1, the disk is thick provisioned, so no performance penalty in the first write. I do not provide the same service quality in lower tier.
  • Network
    • I do not distinguish to keep the service simple. Also, you should not expect drop packets at all times.

With the above definition, you have a clear 3-tier services based on Performance. Let’s now cover Availability service.

Availability: Service Definition

As you know well, mission critical VM has to be better protected than development VM. You would go the extra mile to ensure redundancy. In the event of failure, you also want to cap the degree of the damage. You want to minimize the chance of human error too. The table below provides such an example. I specify both the maximum number of VM in a host and in the cluster. You can choose one only if that is good enough for you.

Capacity Management 03

Capacity Management

Based on the above 2 service definition (Performance and Availability), you can already tell that the capacity management becomes easier.

Tier 1

  • This becomes much simpler. Your ESXi will have low utilization most of the time. Even if all VMs are running at 100%, the ESXi will not hit contention as there is enough physical resource for everyone. As a result, there is no need to track contention as there won’t be any. This means you do not need to check capacity against Performance SLA. You just need to check against Availability SLA.
  • To help you monitor, you can create the following charts:
    • A line chart showing the total number of vCPU left in the cluster.
    • A line chart showing the total number of vRAM left in the cluster.
    • A line chart showing the total number of VM left in the cluster.
    • A line chart showing the maximum storage latency experience by any VM in the cluster
    • A line chart showing the disk capacity left in the datastore cluster.
      • I’m assuming you use a datastore cluster here, which you should.
      • Ideally, map your datastore cluster to compute cluster 1:1.
  • It is a line chart, so you can see the trend across time. All the time lines should be at least 1 month, preferably 3 months. This allows you to spot trend early, as it takes weeks to buy hardware.
  • Guidelines:
    • If the number is approaching low number for CPU, RAM, Disk capacity, it’s time to add capacity. How low that number depends on how fast you can procure hardware.
    • If the number is approaching your threshold for Disk, it’s time to add IOPS.
  • For each line chart, you need to create a super metric.
    • This allows you to set alert.

Tier 2 and 3

  • This is more complex, as there is now contention possibility. You now need to check againts Performance SLA and Availability SLA.
  • To help you monitor, you can create the following:
    • Line chart showing the maximum & average CPU Contention experience by any VM in the cluster.
    • Line chart showing the maximum & average RAM Contention experience by any VM in the cluster.
    • Line chart showing the total number of VM left in the cluster.
    • Line chart showing the maximum & average storage latency experience by any VM in the cluster
    • Line chart showing the disk capacity left in the datastore cluster.
  • Guidelines
    • See Tier 1 and apply accordingly.

Network (all tiers)

  • It’s 2015. You should be on 10GE, as the chance of ESXi saturating 1 GE is not something you can ignore. The chance of ESXi saturating 2x 10 GE link is quite low, unless you run vSphere FT and VSAN (or other form of distributed storage)
  • To help you monitor, you can create the following:
    • A line chart showing the maximum network drop packets at the physical data center level. I use a physical data center as they eventually share the same core switches.
    • A line chart showing the maximum and average ESXi vmnic at the same level as per above.

What’s your thought? Please share your comments and questions below.


2 comments have been added so far

  1. Nice post. However some additional points to consider:

    1) right-sizing the VMs. In benchmarks, 80% of VMs are oversized on vCPU. Analyze your VM usage patterns, preferably using a xx percentile analysis (xx depends on the tier, 99%tile for highest tier, 90%tile for lowest), and identify areas where you can reduce allocations. Employ a governance approach to ensure that resources are assigned appropriately. In fact, it is the poor sizing of VMs that most likely accounts for the (accurate) observation here that host utilization is often low.

    2) Capacity Management has to be directed towards parameters you can easily control. It also has to employ statistical analytics to ensure that peaks are accounted for. Temporal (time-based) variations in workload have a significant effect on effective capacity management. You should ensure that you are planning for those peaks, and balancing workloads effectively.

    3) Capacity management for heterogeneous environments is more complicated since the MHz rating of the machines do not directly translate to actual capacity available. This is also true when you’re planning to provision new hosts or migrate a cluster. Modelling based on industry standard benchmarks can offer accuracy benefits of 50% higher than using MHz. For more on this – visit this cloud capacity blog article.

Leave a Reply

Your email address will not be published. Required fields are marked *