Yes, at first glance, you may be forgiven for thinking that this subject hardly warrants a blog post. But for those of you who have suffered the consequences of an All Paths Down (APD) condition, you'll know why this is so important.
Let's recap on what APD actually is.
APD is when there are no longer any active paths to a storage device from the ESX, yet the ESX continues to try to access that device. When hostd tries to open a disk device, a number of commands such as read capacity and read requests to validate the partition table are sent. If the device is in APD, these commands will be retried until they time out. The problem is that hostd is responsible for a number of other tasks as well, not just opening devices. One task is ESX to vCenter communication, and if hostd is blocked waiting for a device to open, it may not respond in a timely enough fashion to these other tasks. One consequence is that you might observe your ESX hosts disconnecting from vCenter.
We have made a number of improvements to how we handle APD conditions over the last number of releases, but prevention is better than cure, so I wanted to use this post to highlight once again the best practices for removing a LUN from an ESX host and avoid APD:
ESX/ESXi 4.1
Improvements in 4.1 means that hostd now checks whether a VMFS datastore is accessible or not before issuing I/Os to it. This is an improvement, but doesn't help with I/Os that are already in-flight when an APD occurs. The best practices for removing a LUN from an ESX 4.1 host, as described in detail in KB 1029786, are as follows:
- Unregister all objects from the datastore including VMs and Templates
- Ensure that no 3rd party tools are accessing the datastore
- Ensure that no vSphere features, such as Storage I/O Control, are using the device
- Mask the LUN from the ESX host by creating new rules in the PSA (Pluggable Storage Architecture)
- Physically unpresent the LUN from the ESX host using the appropriate array tools
- Rescan the SAN
- Clean up the rules created earlier to mask the LUN
- Unclaim any paths left over after the LUN has been removed
Now this is a rather complex set of instructions to follow. Fortunately, we have made things a little easier with 5.0.
ESXi 5.0
The first thing to mention in 5.0 is that we have introduced a new Permanent Device Loss (PDL) condition – this can help alleviate some of the conditions which previously caused APD. But you could still run into it if you don't correctly remove a LUN from the ESX. There are details in the post about the enhancements made in the UI and the CLI to make the removal of a LUN easier. But there are KB articles that go into even greater detail.
To avoid the rather complex set of instructions that you needed to follow in 4.1, VMware introduced new detach and unmount operations to the vSphere UI & the CLI.
As per KB 2004605, to avoid an APD condition in 5.0, all you need to do now is to detach the device from the ESX. This will automatically unmount the VMFS volume first. If there are objects still using the datastore, you will be informed. You no longer have to mess about creating and deleting rules in the PSA to do this safely. The steps now are:
- Unregister all objects from the datastore including VMs and Templates
- Ensure that no 3rd party tools are accessing the datastore
- Ensure that no vSphere features, such as Storage I/O Control or Storage DRS, are using the device
- Detach the device from the ESX host; this will also initiate an unmount operation
- Physically unpresent the LUN from the ESX host using the appropriate array tools
- Rescan the SAN
This KB article is very good since it also tells you which features (Storage DRS, Storage I/O Control, etc) may prevent a successful unmount and detach.
Please pay particular attention to these KB articles if/when you need to unpresent a LUN from an ESX host.
Get notification of these blogs postings and more VMware Storage information by following me on Twitter: @VMwareStorage