Preparing a host for maintenance is quite simple in a vSphere cluster. With vSphere DRS enabled, you click a host in the vSphere Web Client, select Maintenance Mode, and click Enter Maintenance Mode. vSphere automatically migrates running virtual machines off of the host and places the host in maintenance mode. This effectively removes access to the computing resources provided by the host from the cluster so that an administrator can replace hardware, update the firmware, upgrade vSphere, and perform other maintenance activities without disruption.
vSAN brings additional considerations to placing a host into maintenance mode. This is because the local storage devices in a host contribute to the vSAN datastore. Storage resources provided by a host in maintenance mode are inaccessible until the host exits maintenance mode. vSAN datastore capacity is temporarily reduced.
Fortunately, vSAN includes “what-if” reporting to inform administrators of the impact to the vSAN cluster when placing a host in maintenance mode. Below is an example of this reporting.
The “evacuate all data to other hosts” option is commonly used for longer-term maintenance such as when a host will be offline for more than an hour. We can see that even if we evacuate all data from this host to other hosts, there is still sufficient capacity for all of the objects on the vSAN datastore. However, what this reporting does not take into account is whether there are a sufficient number of fault domains in the cluster. That is what we will focus on for the remainder of this article.
Each host in a vSAN cluster is considered an implicit fault domain. vSAN components – chunks of data that make up objects such as virtual disks – are spread across hosts to maintain availability if a drive or host fails. vSAN fault domains can also be configured, which include multiple hosts. An example of this is eight vSAN hosts in four server racks – two hosts in each. A fault domain is configured for each server rack. vSAN distributes components across server racks (fault domains) to provide resiliency against drive, host, and rack failures.
vSAN maintenance mode does not currently verify and report whether there are enough hosts or fault domains in a cluster to facilitate the evacuation of all data from a host. Consider a cluster with four hosts and storage policies that use RAID-1 mirroring and RAID-5 erasure coding. As shown in the next diagram, migrating a component in a RAID-1 configuration (blue) can be done, but vSAN cannot migrate the component in the RAID-5 configuration (green) as there is no other host to migrate it to without violating the storage policy.
As a result, the process of entering maintenance mode fails with this error:
The workaround is to use the “Ensure data accessibility from other hosts” option, but this means a temporary reduction in resiliency for any objects with components located on the host going into maintenance mode. These objects will not be compliant with their assigned storage policy after the host enters maintenance mode.
A better approach is to provision an additional host or fault domain to facilitate data migration in longer-term maintenance scenarios.
In summary, maintenance mode reporting in a vSAN cluster considers capacity, but not fault domains when providing “what-if” information. Administrators should keep this in mind and include an additional host or fault domain in their vSAN cluster design for maintenance scenarios.
@jhuntervmware
Curious about VMware vSAN? Subscribe to the Virtual Blocks blog or follow our social channels at @vmwarevsan and Facebook.com/vmwarevsan for the latest updates. For more information about VMware vSAN, visit https://www.vmware.com/products/vsan.html.