Capacity Management in SDDC

Get the best performance and efficiency out of your vSphere infrastructure

Conduct a free trial today by installing vSphere with Operations Management in your environment.

Capacity Management policy is interlinked to the Performance Management policy and Availability Management policy. As shown in the diagram below, both Performance and Availability management drive your Capacity Management. The Capacity Management threshold is the lower of the 2. For both policies, you naturally have different service tiers. The availability of a mission critical VM is certainly much higher than a development VM. The same goes for performance. You will not accept any form of resource contention for mission critical VM, but will accept contention in development environment as cost is more important.

It is common for an enterprise to have 3 service tiers. For simplicity, I will call them:

Tier 1: This is the highest, most important tier. All your mission critical VMs are placed here.
Tier 2: This is the middle tier. Majority of production VMs are placed here.
Tier 3: This is the lowest tier. Majority of test & development VMs are placed here.

Avoid having more than 3 tiers. Even in a large environment (>100,000 VM), keep it at 3 tiers. The more tiers, the more confusing it is for your customers (the Application team). The positioning of each tier must be clear. Having too many tiers blurs this positioning.

Performance: Service Definition

Let’s look at an example of a Performance service tier. I said an example as your policy as IaaS provider may vary. You need to describe (or define) the service for each of the 4 infrastructure component (CPU, RAM, Disk, and Network). For each, list all the properties that impact the quality of the service. The table below provides an example of server VM. For VDI, we need to have a different definition.

Let’s go through the above table.

CPU and RAM
- Notice I do not have over subscription ratio. I do not define something like “1.5x CPU Over Subscription” or “2x RAM Over Subscription”. This is because over subscription is an incomplete policy. It fails to take into account utilization. I’ve seen this in customers, where the higher tier perform worse than the lower tier. Once you oversubscribe, you are no longer able to guarantee consistent performance. Contention can happen, even if your ESXi utilization is not high.
- I use contention to quantify the SLA. The chance of contention goes up as the tier gets lower. Tier 3 has a higher threshold.
- Tier 1 has no over subscription. There is enough CPU and RAM for every VM in the host. No VM needs to wait or contend for resource. As a result, reservation is not applicable.
- I specify that all the hosts in Tier 1 cluster are also identical. That means the CPU generation and speed are identical. This makes performance predictable. I do not make such guarantee in Tier 3. The cluster may start with 4 identical nodes, but over time, it may grow into 16 nodes. The 16-node is certainly not identical in terms of performance as the new nodes will sport faster CPU.
Storage
- The performance SLA is set at 10 ms. I use a 5 minute average as this is a good balance.
- In Tier 1, the disk is thick provisioned, so no performance penalty in the first write. I do not provide the same service quality in lower tier.
Network
- I do not distinguish to keep the service simple. Also, you should not expect drop packets at all times.

With the above definition, you have a clear 3-tier services based on Performance. Let’s now cover Availability service.

Availability: Service Definition

As you know well, mission critical VM has to be better protected than development VM. You would go the extra mile to ensure redundancy. In the event of failure, you also want to cap the degree of the damage. You want to minimize the chance of human error too. The table below provides such an example. I specify both the maximum number of VM in a host and in the cluster. You can choose one only if that is good enough for you.