Product Announcements

VMware Virtual SAN Operations: Replacing Disk Devices

VSAN-Ops-LogoIn my previous Virtual SAN operations article, “VMware Virtual SAN Operations: Disk Group Management” I covered the configuration and management of the Virtual SAN disk groups, and in particular I described the recommended operating procedures for managing Virtual SAN disk groups.

In this article, I will take a similar approach and cover the recommended operating procedures for replacing flash and magnetic disk devices. In Virtual SAN, drives can be replaced for two reasons; failures, and upgrades. Regardless of the reason whenever a disk device needs to be replaced, it is important to follow the correct decommissioning procedures.

Replacing a Failed Flash Device

The failure of flash device renders an entire disk group inaccessible (i.e. in the “Degraded” state) to the cluster along with its data and storage capacity.  One important observation to highlight here is that a single flash device failure doesn’t necessarily mean that the running virtual machines will incur outages. As long as the virtual machines are configured with a VM Storage Policy with “Number of Failures to Tolerate” greater than zero, the virtual machine objects and components will be accessible.  If there is available storage capacity within the cluster, then in a matter of seconds the data resynchronization operation is triggered. The time for this operation depends on the amount of data that needs to be resynchronized.

When a flash device failure occurs, before physically removing the device from a host, you must decommission the device from Virtual SAN. The decommission process performs a number of operations in order to discard disk group memberships, deletes partitions and remove stale data from all disks. Follow either of the disk device decommission procedure defined below.

Flash Device Decommission Procedure from the vSphere Web Client

  1. Log on to the vSphere Web Client
  2. Navigate to the Hosts and Clusters view and select the cluster object
  3. Go to the manage tab and select Disk management under the Virtual SAN section
  4. Select the disk group with the failed flash device
  5. Select the failed flash device and click the delete button

Note: In the event the disk claim rule settings in Virtual SAN is set to automatic the disk delete option won’t be available in the UI. Change the disk claim rule to “Manual” in order to have access to the disk delete option.

Flash Device Decommission Procedure from the CLI (ESXCLI) (Pass-through Mode)

  1. Log on to the host with the failed flash device via SSH
  2. Identify the device ID of failed flash device
    • esxcli vsan storage list

SSD-UUID

  1. delete the failed flash device from the disk group
    • esxcli vsan storage remove -s <device id>

SSD-UUID-CLI

Note: Deleting a failed flash device will result in the removal of the disk group and all of it’s members.

  1. Remove the failed flash device from the host
  2. Add a new flash device to host and wait for the vSphere hypervisor to detect it, or perform a device rescan.

Note: These step are applicable when the storage controllers are configured in pass-though mode and support hardware hot-plug feature.

Upgrading a Flash Device

Before upgrading the flash device, you should ensure there is enough storage capacity available within the cluster to accommodate all of the currently stored data in the disk group, because you will need to migrate data off that disk group.

To migrate the data before decommissioning the device, place the host in maintenance mode and choose the suitable data migration option for the environment. Once all the data is migrated from the disk group, follow the flash device decommission procedures before removing the drive from the host.

Replacing a Failed Magnetic Disk Devices

Each magnetic disk is accountable for the storage capacity it contributes to a disk group and the overall Virtual SAN datastore. Similar to flash, magnetic disk devices can be replaced for failures or upgrade reasons. The impact imposed by a failure of a magnetic disk is smaller when compared to the impact presented by the failure of a flash device. The virtual machines remain online and operational for the same reasons described above in the flash device failure section.  The resynchronization operation is significantly less intensive than a flash device failure. However, again the time depends on the amount of data to be resynchronized.

As with flash devices, before removing a failed magnetic device from a host, decommission the device from Virtual SAN first. The action allows Virtual SAN to perform the required disk group and devices maintenance operations as well as allow the subsystem components to update the cluster capacity and configuration settings.

vSphere Web Client Procedure (Pass-through Mode)

  1. Login to the vSphere Web Client
  2. Navigate to the Hosts and Clusters view and select the Virtual SAN enabled cluster
  3. Go to the manage tab and select Disk management under the Virtual SAN section
  4. Select the disk group with the failed magnetic device
  5. Select the failed magnetic device and click the delete button

Note: It is possible to perform decommissioning operations from ESXCLI in batch mode if required. The use of the ESXCLI does introduces a level of complexity that should be avoided unless thoroughly understood. It is recommended to perform these types of operations using the vSphere Web Client until enough familiarity is gained with them.

Magnetic Device Decommission Procedure from the CLI (ESXCLI) (Pass-through Mode)

  1. Login to the host with the failed flash device via SSH
  2. Identify the device ID of failed magnetic device
    • esxcli vsan storage listmag-change
  3. delete the magnetic device from the disk group
    • esxcli vsan storage remove -d <device id>HDD-UUID-CLI
  4.  Add a new magnetic device to the host and wait for the vSphere hypervisor to detect it, or perform a device rescan.

Upgrading a Magnetic Disk Device

Before upgrading any of the magnetic devices ensure there is enough usable storage capacity available within the cluster to accommodate the data from the device that is being upgraded. The data migration can can be initiated by placing the host in maintenance mode and choosing a suitable data migration option for the environment. Once all the data is offloaded from the disks, proceed with the magnetic disk device decommission procedures.

In this particular scenario, it is imperative to first decommission the magnetic disk device before physically removing from the host. If the disk is removed from the host without performing the decommissioning procedure, data that is cached from that disk will end up being permanently stored in the cache layer. This could reduce the available amount of cache and eventually impact the performance of the system.

Note: The disk device replacement procedures discussed in this article are entirely based on storage controllers configured in pass-through mode. In the event the storage controllers are configured in a RAID0 mode, follow the manufactures instructions for adding and removing disk devices.

– Enjoy

For future updates on Virtual SAN (VSAN), Virtual Volumes (VVols), and other Software-defined Storage technologies as well as vSphere + OpenStack be sure to follow me on Twitter: @PunchingClouds