Technical VCF Storage (vSAN)

Understanding Reserved Capacity Concepts in vSAN

In 2020, the release of vSAN 7 U1 introduced new capabilities that helped reduce the amount of free capacity required for a vSAN to perform transient activities and accommodate for host failures. We paired these improvements with a new “Reserved Capacity” capability found in the product, which replaced the guidance on the free capacity recommendations (known as “slack space”) for vSAN 7 and earlier.

While this enhancement increased the usable capacity for vSAN clusters running vSAN 7 U1 and newer, there has been some confusion on the Reserved Capacity feature as it relates to design and operations. This post attempts to provide clarity on the matter.

Reserved Capacity Overview

For vSAN 7 U1 and newer, “Reserved Capacity” is an umbrella term referring to the capacity needed for transient activities and host failures. Reserved Capacity is comprised of two parameters: Operations Reserve and Host Rebuild Reserve. It can be enabled via toggles in vCenter Server, and serves as a management tool to help enforce good capacity management practices.

Figure 1. Reserved Capacity toggles found in the “Reservations and Alerts” section of the vSAN Capacity Overview.

Why are they optional? The Reserved Capacity feature was designed for the most common environmental and topology conditions. Public cloud and VMware Cloud Provider (VCPP) partners may have different operational practices that do not require this type of capacity safeguard. Private cloud topologies such as stretched clusters, 2-node clusters, and clusters using fault domains are not capable of using the Reserved Capacity mechanism at this time. In those configurations, customers should size for the amount of free capacity based on guidance for previous versions of vSAN.

When the toggles are enabled, vSAN will change the amount of free capacity advertised on the cluster and will impose soft thresholds that prevent provisioning activities from consuming the Reserved Capacity. Hitting these thresholds does not in any way prevent I/O activities from continuing.

When the toggles are not enabled, vSAN will not reserve any of this capacity and present it as free capacity for use by VM’s or transient activities. While vSAN will continue to use built-in mechanisms that will prevent undesirable behavior from near cluster full conditions, it will not safeguard against insufficient capacity as a result of host failure, or larger operational efforts. For the best operational experience, it is highly recommended to enable the Reserved Capacity toggles when possible.

Operations Reserve

The “Operations Reserve” (OR) setting accounts for the capacity needed to perform transient storage activities like storage policy changes, rebalancing, and other activities. The amount reserved will make some general assumptions about object size, and will also consider the following:

  • Raw size of the capacity devices used in the hosts.
  • The number of capacity devices per host.
  • The number of disk groups per host.
  • The use of the cluster-based Deduplication & Compression service.

The vSAN Design Guide and Operations Guide provide several examples of how the total reserved capacity is decreased as the cluster size is increased – a distinct difference compared to the legacy “slack space” calculation that used a fixed percentage regardless of cluster size. The examples referenced above assumed a set of deployment variables, with only the cluster size adjusted – to demonstrate the new sizing behavior. As a result, many thought that the Operations Reserve remained the same across all clusters. This is not true. The Operations Reserve does indeed vary depending on the hardware configuration and software services configured for a vSAN cluster. Let’s look at some examples.

Cluster host configuration examplesSize of capacity devices# of capacity devices per host# of disk groups per hostDD&C enabled?Operations Reserve
Host config example #14TB21No17%
Host config example #24TB42No12%
Host config example #34TB82No10%
Host config example #44TB82Yes8%
Host config example #54TB82No7%
Host config example #64TB82Yes6%

Figure 2. Examples of how Operations Reserve changes based on the hardware specifications of a host and cluster services enabled.

This calculation is fully independent of the Host Rebuild Reserve. The cluster size is also omitted from the above examples because it does not influence the calculation for Operations Reserve.

We can see from the examples above, that the amount percentage for Operations Reserve can vary quite a bit depending on the specifications of the host. These results assume that all hosts within a cluster have the same hardware specifications. For more information on the uniformity of hosts across a vSAN cluster, see the post: Asymmetrical vSAN Clusters – What is allowed, and What is Smart.

Recommendation: Use capacity devices of approximately 4TB or larger and multiple capacity devices per host. Larger capacity devices and more capacity devices can dramatically reduce the percentage required for Operations Reserve. Disk group quantity and DD&C status have a lesser impact.

Host Rebuild Reserve

The “Host Rebuild Reserve” (HRR) setting accounts for the capacity needed to absorb a sustained failure of a single host in a vSAN cluster – to support an N+1 cluster design strategy. The percentage reserved is a reflection of the capacity of one host in the cluster relative to the total host count of the cluster. The Host Rebuild Reserve behaves like any N+1 design strategy, where the percentage of resources that a single host contributes to the cluster will decrease as the host count increases. For example, the Host Rebuild Reserve for a 4 node cluster would be 25% while it would be just 8% for a 12 node cluster. As shown in Figure 3, the rate of reduction diminishes significantly as the cluster host count begins to exceed about 12 or more hosts.

Figure 3. Percentage of capacity used for Host Rebuild Reserve based on total number of hosts in cluster.

While the Host Rebuild Reserve calculation is fully independent of the Operations Reserve, the Host Rebuild Reserve toggle can only be enabled if the Operations Reserve toggle is also enabled. The Host Rebuild Reserve is calculated based off of N+1 – tolerating the loss of capacity of a single host. This shouldn’t prevent you from designing for N+2 or greater if there are requirements to do so.

Recommendation: Aim for clusters host counts large enough to lower the percentage needed for the Host Rebuild Reserve while recognizing there are diminishing returns on the HRR savings as the cluster host count increases. This will provide the most efficient, yet agile cluster design.

The result is that depending on the configuration and size of the cluster, it is possible to see a total Reserved Capacity (OR + HRR) of under 10%. This is a dramatic difference compared to the guidance offered in versions of vSAN older than vSAN 7 U1.

Determining the amount of Reserved Capacity for a Cluster

To determine the capacity required for Reserved Capacity (whether it be Operations Reserve, or Operations Reserve + Host Rebuild Reserve) in your new cluster, use the vSAN ReadyNode Sizer. For VxRail customers, a VxRail Sizer can be used that will produce the same result. Relying on these tools exclusively helps accommodate for the following:

  • Ongoing optimizations to vSAN. As vSAN evolves, so do design and sizing recommendations. This ensures the calculations are made based on the version of vSAN used.
  • Exceptions and corner case scenarios. In some extreme circumstances (such as very small clusters and small host configurations), the Reserved Capacity (OR + HRR) may calculate out to greater than 30%. In these cases, vSAN will NOT need greater than 30%, and the sizing tools can accurately reflect this consideration.

Once a cluster is in production, one can view how each toggle impacts the reported “free capacity” changes by toggling them off or on as desired. The capacity view of vSAN in vCenter Server does not change the raw capacity advertised, but rather, the free capacity reported.

Note that if you are testing out this functionality on a very small nested vSAN cluster, the enabling of Reserved Capacity capability may have no impact, as it was not designed for extremely small capacity devices created in nested environments.

The recommendations provided by the sizer do not account for snapshotting activity or growth of thin provisioned volumes. Just as with traditional storage, it will be up to the customer to determine how much capacity is needed for growth of their thin provisioned volumes, as well as snapshotting activity. This may include short-term snapshot tasks such as VADP based backup solutions or ongoing snapshot tasks through CI/CD or VDI workflows.

Summary

The “Reserved Capacity” mechanism in vSAN 7 U1 and later is a powerful capacity management tool to help ensure a cluster has sufficient resources for transient activities and sustained host failure scenarios. The amount of Reserve Capacity required for a cluster can be easily determined through the vSAN ReadyNode Sizer, and when enabled in an existing cluster, can be found easily in vCenter Server.

@vmpete