vSAN Hyperconverged Infrastructure Software-Defined Storage

A Closer Look at vSAN Maintenance Mode

vSAN 6.7 U1 introduces several enhancements and improved safeguards when performing maintenance and decommissioning activities on vSAN hosts.  vSAN Maintenance Mode now has a pre-check simulation, new warnings for hosts already in maintenance mode and ongoing resyncs.  In addition, the object repair timer previously only available in the CLI is now available in the UI.  This post will take a closer look into these new enhancements.

vSphere administrators have long enjoyed the simplicity of the Maintenance Mode feature: commonly used for planned downtime activities such as firmware updates, storage device replacement, and software patches. Assuming VMware vSphere Distributed Resource Scheduler (DRS) is enabled (fully automated), Maintenance Mode will migrate virtual machines from the host entering maintenance mode to other hosts in the cluster.

Maintenance Mode Considerations for vSAN

Since each vSAN host in a cluster contributes to the cluster storage capacity, entering a host into maintenance mode takes on an additional set of tasks when compared to a traditional architecture.  For this reason, vSAN administrators are presented three host maintenance mode options:

  • Full Data Migration – evacuate all of the components to other hosts in the cluster
  • Ensure Accessibility – evacuate enough components to ensure that virtual machines can continue to run, but non-compliant with the respective storage policies
  • No Data Migration – Don’t evacuate any components from this host

Full Data Migration 

This option moves all of the vSAN components from the host entering maintenance mode to other hosts in the vSAN cluster. This option is commonly used when a host will be offline for an extended period of time or permanently decommissioned.

Ensure accessibility 

vSAN will verify whether an object remains accessible even though one or more components will be absent due to the host entering maintenance mode. If the object will remain accessible, vSAN will not migrate the component(s). If the object would become inaccessible, vSAN will migrate the necessary number of components to other hosts ensuring that the object remains accessible. This option is the default and it is commonly used when the host will be offline for just a short amount of time, e.g., a host reboot. It minimizes the amount of data that is migrated while ensuring all objects remain accessible. However, the level of failure tolerance will likely be reduced for some objects until the host exits maintenance mode. vSAN’s object manager will wait 60 minutes (by default) before it tries to initiate any resynchronizations to regain the level of resilience originally assigned by the policy.

No data evacuation

Data is not migrated from the host as it enters maintenance mode regardless of what the  level of failures to tolerate dictates in policy. Technically, this option can also be used when the host will be offline for a short period of time, but since no data is migrated, virtual machines with no data redundancy (i.e. RAID-0) will be unavailable. This option is primarily used for a full cluster shutdown.

Pete Koehlor recently wrote an in-depth post explaining the differences between the vSAN Maintenance Mode migration options and compares behaviors between a failure tolerance method (FTM) of RAID-1 and RAID-5.

EMM Pre-check Simulation

vSAN 6.7 U1 Maintenance Mode now performs a full simulation of data movement to determine if the Enter Maintenance Mode (EMM) action will succeed or fail before it even starts.  This will prevent unnecessary data movement, and provide a result more quickly to the administrator.

In the example below vm01 is RAID-0 (no data redundancy) while vm02 and vm03 are RAID-1 with a failure tolerance level greater than 1. Choosing Ensure accessibility has the following result:

  1. 440 MB will be moved (the objects belonging to vm01 have no redundancy and need to be migrated to stay accessible)
  2. 30 objects will be non-compliant with storage policy (all other VMs are RAID-1 and will still be accessible without being migrated, but temporarily non-compliant with storage policy until the host returns)
vSAN 6.7
Figure 1: EMM Ensure accessibility

Choosing the No data migration option has the following result:

  • 3 objects belonging to vm01 will be in inaccessible because there is no redundant copy of these objects anywhere.
  • The remaining 30 objects (vm02, vm03 +) will remain accessible but non-compliant with their respective storage policies.
vSAN 6.7
Figure 2: No data migration

Clicking See detailed report will show detailed results of the EMM action.  The results show vm01 will be inaccessible while the remaining VMs will be accessible but non-compliant until the host is out of maintenance mode.

vSAN 6.7
Figure 3: Detailed report

vSAN 6.7 U1 also introduces new warnings for EMM activities to ensure that there are no other hosts already in maintenance mode or resync activity current performing. All of these improvements are added to enhance the overall experience and predictability of host decommissioning activities like entering a host into maintenance mode.

Object Repair Timer

When a host is in maintenance mode (unless the Full data migration option was selected) the objects on the host will be considered "absent."  vSAN will wait 60 minutes until initiating a rebuild of the absent objects on remaining hosts. This is because vSAN is not certain if the failure is transient or permanent. If the host will only be in maintenance mode for a few minutes it doesn’t make sense to completely rebuild all the objects on a different host. If the host will be offline for more than the default 60 minutes the administrator can modify the vSAN Object repair timer. As of vSAN 6.7 U1 this setting is available in the UI. This is now a cluster-wide setting and will apply to all hosts in the cluster.

Configure  vSAN\Services\Advanced Options

vSAN 6.7
Figure 4: Object Repair Timer

Canceling Maintenance Mode

6.7 U1 improves the ability to cancel all operations related to a previous EMM event. In previous editions of vSAN, customers who start an EMM, then cancel it and start again on another host could introduce unnecessary resynchronization traffic. Versions prior to 6.7 U1 would stop the management task, but not necessarily stop the queued resynchronization activities.  This has been addressed in vSAN 6.7 U1.  When the cancel operation is initiated, active resyncs will likely continue, but all resyncs related to that event, that are pending in the queue, will be canceled.

Summary

When entering a host into maintenance mode, there are several things to consider, like how long the host will be in Maintenance Mode, and what failure tolerance methods are assigned to the VMs residing on the host. The “Ensure accessibility” should be viewed as a flexible way to accommodate host updates and restarts. Planned events such as maintenance mode activities and unplanned events such as host outages may make the effective storage policy condition different than the assigned policy. vSAN constantly monitors this, and when resources become available to fulfill the rules of the policy, it will adjust the data accordingly.

@vPedroArrow