vSAN “slack space” is simply free space that is set aside for operations such as host maintenance mode data evacuation, component rebuilds, rebalancing operations, and VM snapshots. Activities such as rebuilds and rebalancing can temporarily consume additional raw capacity. Host maintenance mode temporarily reduces the total amount of raw capacity a cluster has. This is because the local drives on a host that is in maintenance mode do not contribute to vSAN datastore capacity until the host exits maintenance mode. We will dig into this more in another vSAN Operations article.
There are a number of sources such as here and here that recommend 25-30% slack space when designing and running a vSAN cluster. For example, a vSAN datastore with 20TB of raw capacity should always have 5-6TB of free space available for use as slack space. This recommendation is not exclusive to vSAN. Most other HCI storage solutions follow similar recommendations to allow for fluctuations in capacity utilization. We will, of course, focus on vSAN in this article. More specifically, this article covers one more very good reason to maintain that slack space: Storage policy changes.
There are a couple of cases where storage policy changes can temporarily consume more capacity. One scenario is when a new policy that requires a change in component number and/or layout is assigned to a VM. Another scenario is when an existing storage policy that is assigned to one or more VMs is modified. In both cases, vSAN will use additional capacity to make the necessary changes to components to comply with the assigned storage policy.
Consider the following example…
A 100GB virtual disk is assigned a storage policy that includes the rules Primary Level of Failures to Tolerate = 1 and Failure Tolerance Method = RAID-1 mirroring. vSAN creates two full mirrors or “replicas” of the virtual disk and places them on separate hosts. Each replica consists of one component. There is also a witness component created, but we will not factor that in as witness components are very small – typically, around 2MB. The two replicas for the 100GB virtual disk objects consume up to 200GB of raw capacity (objects on a vSAN datastore are “thin provisioned” by default). We will assume deduplication and compression are not enabled to keep this example simple.
A new storage policy is created. Primary Level of Failures to Tolerate = 1 and Failure Tolerance Method = RAID-5/6 erasure coding. The new policy is assigned to that same 100GB virtual disk. vSAN begins copying the existing mirrored components to a new set of components distributed in a RAID-5 erasure coding configuration. Data integrity and availability are maintained as the mirrored components continue to serve reads and writes while the new RAID-5 component set is built. This process naturally consumes additional raw capacity as the new components are built. Once the new components are completely built, IO is transferred to the new components and the old mirrored components are deleted. The new RAID-5 component set consumes up to 133GB of raw capacity. This means all of the components for this object could consume as much as 333GB of raw capacity just before the resync is complete and the old RAID-1 mirrored components are deleted. After the RAID-1 components are deleted, the capacity that was consumed by these components is automatically freed up for use with other operations.
As you can imagine, performing this storage policy change on multiple VMs concurrently could cause a considerable amount of additional raw capacity to be consumed. Likewise, if a storage policy that is assigned to many VMs is modified, more capacity will likely be needed temporarily to make the necessary changes to components that make up these VMs. This is one more reason it is important to maintain sufficient slack space in a vSAN cluster. This is especially true if changes occur frequently and/or these changes impact multiple VMs at the same time.
The vSAN Health UI produces warnings when disk space utilization is higher than 80%. I also recommend a vCenter alarm that produces an alert if vSAN datastore utilization is higher than 70%.
@jhuntervmware