With the recent launch announcement for Virtual SAN 6.2, we have been revisiting some of our previous guidance and best practices around Virtual SAN operations. Possibly the most important of these is disk device handling and replacement procedures. This has historically been tied to host maintenance mode for reasons that I shall explain shortly. Those of you familiar with Virtual SAN will know there are three host maintenance mode options:

  • Full Data Migration – evacuate all of the components to other hosts in the cluster
  • Ensure Accessibility – evacuate enough components to ensure that virtual machines can continue to run, albeit at risk
  • No Data Migration – Don’t evacuate any components from this host

Back in the initial release of Virtual SAN, we did not have the ability to evacuate data from individual disks or disk groups; the only way to evacuate the data from a disk or disk group was to use host maintenance mode. The one drawback with this approach is that a host could have multiple disks or multiple disk groups, and one could understand the reluctance to evacuate all of the data on a host just to replace one disk. As a result some customers did not evacuate the data when replacing disks. They simply removed the disk, meaning that the components  go degraded, replaced the disk, and let the components rebuild. This approach led to a number of complexities. In some cases customers did not realize there was an underlying issue in the cluster, or that a policy was being used with did not protect the VM (number of failures to tolerate = 0 ), or that there were other hosts in the cluster in maintenance mode. And in some cases, being hit with another issue during the disk replacement/rebuild time-frame. All of these impacted the availability of the virtual machines (more failures than was defined in the policy for the VM). Because customers wiped the disk(s) with the one remaining healthy component, data is permanently lost as result.

Because of such incidents, VMware is making the following recommendations on disk replacement procedures on Virtual SAN. If customers wish to replace a disk or a disk group on Virtual SAN and also ensure that all of their VMs remain available post replacement, VMware now strongly recommends completely evacuating all of data in the disk/disk group first, even though this data may already be replicated and protected elsewhere in the cluster.

In the first release of Virtual SAN (5.5), in order to adhere to this guidance, administrators should:

  1. Place the host with the disk into maintenance mode, and select Full data migration.
  2. Decommission/Remove the disk groups and replace the appropriate drives.

One consideration is that in order to do a ‘Full data migration’ there must be enough spare capacity and fault domains in the cluster during the disk replacement process. If that is not an option, the safest thing to do is to reconfigure VMs in the cluster with a policy of setting of Number of failures to tolerate = 0, and repeat the steps above. This will ensure that the data component associated with the VM is active and healthy, and will be moved between disk groups when ‘Full data migration’ is selected. Once the disk has been replaced and all hosts are back in the cluster, set the policy back to the correct Number of failures to tolerate and give the cluster time to reconfigure/resync the objects.

However, since 6.0, administrators have the ability to evacuate individual disks or disk groups, avoiding the need to place a complete host in maintenance mode, and evacuating what could be a considerable amount of data. This means that the users should ensure that the Evacuate Data option is selected when removing a disk or a disk group. This moves all data out of the device or group before replacement. Here are some screenshots showing the evacuate data checkbox selected when removing a disk and a disk group:

Remove disk group

Remove disk

The same consideration might arise if there are not enough resources in the cluster to evacuate the data. From vSphere 6.0 onwards, there is an esxcli command to decommission disks or disk groups in reduced availability mode before replacing them. Here is the command to do the same:

So you might ask when would an administrator use the maintenance mode option “Ensure accessibility”? This maintenance  mode is strictly intended to be used for software upgrades or node reboots and is not intended to be used with an intent to wipe data out of disks on the node.

And finally when would one use “No Data Migration”? This is typically only used when completely shutting down a VSAN cluster (or maybe for a non-intrusive operation like a quick reboot).