Home > Blogs > VMware vSphere Blog


vSphere 5.0 Storage Features Part 8 – Handling the All Paths Down (APD) condition

All Paths Down (APD) is an issue which has come up time and time again, and has impacted a number of our customers. Let's start with a brief description about what All Paths Down (APD) actually is & how it occurs, & what impact it has on the host. Then we'll get into how we have improved the behaviour in 5.0.

A brief overview of APD

APD is what occurs on an ESX host when a storage device is removed in an uncontrolled manner from the host (or the device fails), and the VMkernel core storage stack does not know how long the loss of device access will last. A typical way of getting into APD would be a Fiber Channel switch failure or (in the case of an iSCSI array) a network connectivity issue.  Note that if careful consideration is given to the redundancy of your multi-pathing solution and our best practices for setting up switch and HBA redundancy is followed, it could help a lot in avoiding this situation.

The APD condition could be transient, since the device or switch might come back. Or it could be permanent in so far as the device might never come back. In the past, we kept the I/O queued indefinitely, and this resulted in I/Os to the device hanging (not acknowledged). This became particularly problematic when someone issued a rescan of the SAN from a host or cluster which was typically the first thing customer tried when they found a device was missing. The rescan operation caused hostd to block waiting for a response from the devices (which never comes). Because hostd is blocked waiting on these responses, it can't be used by other services, like the vpx agent (vpxa) which is responsible for communication between the host and vCenter. The end result is the host becoming disconnected from vCenter. And if the device is never coming back, well we're in a bit of a pickle! :-(

It should also be noted that hostd could also grind to a halt even without a rescan of the SAN being initiated.  The problem is that hostd has a limited number of worker threads. If enough of these threads get stuck waiting for I/O to a device that is not responding, hostd will eventually be unable to communicate to anything else, including healthy devices, because it doesn't have any free worker threads to do any work.

 

APD Handling before vSphere 5.0

To alleviate some of the issues arising from APD, a number of advanced settings were added which could be tweaked by the admin to mitigate this behaviour. Basically this involved not blocking hostd when it was scanning the SAN, even when it came across a non-responding device (i.e. a device in APD state). This setting was automatically added in ESX 4.1 Update 1 and ESX 4.0 Update 3. Previous versions required customers to manually add the setting. This is discussed in greater detail in KB 1016626. 

 

APD Handling Enhancements in vSphere 5.0 – Introducing Permanent Device Loss (PDL)

In vSphere 5.0, we have made improvements to how we handle APD. First, what we've tried to do is differentiate between a device to which connectivity is permanently lost, and a device to which connectivity is transiently or temporarily lost. We now refer to a device which is never coming back as a Permanent Device Loss (PDL).
  • APD is now considered a transient condition, i.e. the device may come back at some time in the future.
  • PDL is considered a permanent condition where the device is never coming back.

As mentioned earlier, I/O to devices which were APD would be queued indefinitely. With PDL devices (those devices which are never coming back), we will now fail the I/Os to those devices immediately. This means that we will not end up in a situation where processes such as hostd get blocked waiting on I/O to these devices, which also means that we don't end up in the situation where the host disconnects from vCenter.

This begs the question – how do we differentiate between devices with are APD or PDL?

The answer is via SCSI sense codes. SCSI devices can indicate PDL state with a number of sense codes returned on a SCSI command. One such sense code is 5h / ASC=25h / ASCQ=0 (ILLEGAL REQUEST / LOGICAL UNIT NOT SUPPORTED). The sense code returned is a function of the device. The array is in the best position to determine if the requests are for a device that no longer exists, or for a device that just has an error/problem. In fact, in the case of APD, we do not get any sense code back from the device.

When the last path to the device in PDL returns the appropriate sense code, internally the VMkernel changes the device state to one which indicates a PDL. It is important to note that ALL paths must be in PDL for the DEVICE to become PDL. Once a device is in this state, commands issued to that device will return with VMK_PERM_DEV_LOSS. In other words, we now fail the I/Os immediately rather than have I/Os hanging. This will mean that with PDL, hostd should never become blocked, and no hosts should disconnect from vCenter. Hurrah!

None of these distinctions between APD and PDL are directly visible to the Virtual Machines that are issuing I/Os. SCSI commands from a VM to an APD/PDL device are simply not responded to by the VMkernel, and when they timeout, it elicits a retransmit attempt from the host. In other words, VMs retry their I/O indefinitely. That is why in most cases, if the device which was in APD does comes back, the VMs continue to run from where they left off – this is one of the really great features of virtualization in my opinion (and has saved many an admin who inadvertently disconnected the wrong cable or offlined the wrong device) :-)

 

Best Practice to correctly remove a device & avoid APD/PDL

There has never been an intuitive or well defined way to remove a storage device from an ESX in the past. Now we have a controlled procedure to do this in 5.0.

5.0 introduces two new storage device management techniques. You now have the ability to mount/unmount a VMFS volume & attach/detach a device. Therefore, if you want to remove the device completely, the first step is to  unmount the volume. In this example, I have a NetApp device on which a VMFS-5 filesystem has been created. First, get the NAA id of the underlying LUN as you will need this later. This can be found by clicking on the Properties of the datastore, & looking at the extent details:

Naa-id

Before proceeding, make sure that there are no running VMs on the datastore, that the datastore is not part of a datastore cluster (unused by Storage DRS), is not used by vSphere HA as a heartbeat datastore & does not have Storage I/O Control enabled.

With the NAA id noted, right click on the volume in the Configuration > Storage > Datastores view, and select Unmount:

Unmount

When the unmount operation is selected, a number of checks are done on the volume to ensure that it is in a state that allows it to be unmounted, i.e. no running VMs, Storage I/O Control not enabled, not used as a Heartbeat Datastore, not managed by Storage DRS.

   Coinfirm-unmount
If the checks pass, click OK & then the volume is unmounted.

Umount3

The CLI command to do an unmount is esxcli storage filesystem unmount if you prefer to do it from the ESXi shell. When that the volume is safely unmounted, the next step is to detach it from the host. This can also done either via the CLI or via the UI in the Configuration > Storage window, but the view must be changed to Devices rather than Datastores. Click on the Devices button, select the correct device using the NAA id noted previously, right click and select Detach:

Detach-new1
The detach will check that the volume is indeed in a state that allows it to be detached:

Detach-checks
The same task can be done via the ESXi shell. The command that I need to use to do a detach is esxcli storage core device set –state=off. Note how the Status changes from on to off in the following commands:

~ # esxcli storage core device list -d naa.60a98000572d54724a34642d71325763
naa.60a98000572d54724a34642d71325763
   Display Name: NETAPP Fibre Channel Disk (naa.60a98000572d54724a34642d71325763)
   Has Settable Display Name: true
   Size: 3145923
   Device Type: Direct-Access
   Multipath Plugin: NMP
   Devfs Path: /vmfs/devices/disks/naa.60a98000572d54724a34642d71325763
   Vendor: NETAPP
   Model: LUN
   Revision: 7330
   SCSI Level: 4
   Is Pseudo: false
   Status: on
   Is RDM Capable: true
   Is Local: false
   Is Removable: false
   Is SSD: false
   Is Offline: false
   Is Perennially Reserved: false
   Thin Provisioning Status: yes
   Attached Filters:
   VAAI Status: unknown
   Other UIDs: vml.020000000060a98000572d54724a34642d713257634c554e202020
 

~ # esxcli storage core device set –state=off -d naa.60a98000572d54724a34642d71325763
 

~ # esxcli storage core device list -d naa.60a98000572d54724a34642d71325763
naa.60a98000572d54724a34642d71325763
   Display Name: NETAPP Fibre Channel Disk (naa.60a98000572d54724a34642d71325763)
   Has Settable Display Name: true
   Size: 3145923
   Device Type: Direct-Access
   Multipath Plugin: NMP
   Devfs Path:
   Vendor: NETAPP
   Model: LUN
   Revision: 7330
   SCSI Level: 4
   Is Pseudo: false
   Status: off
   Is RDM Capable: true
   Is Local: false
   Is Removable: false
   Is SSD: false
   Is Offline: false
   Is Perennially Reserved: false
   Thin Provisioning Status: yes
   Attached Filters:
   VAAI Status: unknown
   Other UIDs: vml.020000000060a98000572d54724a34642d713257634c554e202020
~ #

This device is now successfully detached from the host. It remains visible the UI at this point:

Detach-new2

You can now do a rescan of the SAN and this will safely remove the device from the host.

 

APD may still occur

The PDL state is definitely a major step in handling APD conditions. The unmount/detach mechanism should also alleviate certain APD conditions. However there is still a chance that APD conditions can occur. For instance, if the LUN fails in a way which does not return the sense codes expected by PDL, then you could still experience APD in your environment. VMware is continuing to work on the APD behaviour to mitigate the impact it might have on your infrastructure.

 

Detecting PDL Caveat

There is one important caveat to our ability to detect PDL. Some iSCSI arrays map a Lun to Target as a one-to-one relationship, i.e. there is only ever a single LUN per Target.

In this case, the iSCSI arrays do not return the appropriate SCSI sense code, so we cannot detect PDL on these arrays types.

However most other storage arrays on our HCL should be able to provide SCSI sense code to enable the VMkernel detect PDL.

 Follow these blogs and more VMware Storage information on Twitter: Twitter VMwareStorage

This entry was posted in Storage, vSphere on by .
Cormac Hogan

About Cormac Hogan

Cormac Hogan is a senior technical marketing architect within the Cloud Infrastructure Product Marketing group at VMware. He is responsible for storage in general, with a focus on core VMware vSphere storage technologies and virtual storage, including the VMware vSphere® Storage Appliance. He has been in VMware since 2005 and in technical marketing since 2011.

21 thoughts on “vSphere 5.0 Storage Features Part 8 – Handling the All Paths Down (APD) condition

  1. Jeff Couch

    Great article! Thanks for the in depth explanation of APD. This has been a huge pain point for us in the past. I hope this new procedure helps.

    Reply
  2. Loren

    Great post, thanks! One question…what if you’ve unmounted a LUN, rescan to remove the device from the host, and then decide you want it back?
    Cheers,
    -Loren

    Reply
  3. Andy Banta

    Great article, Cormac. The reason no sense information is returned with iSCSI arrays with 1 LU per target, is due to no session existing with the target once it goes away. With iSCSI, sessions are between the host and the target. If there’s no target any more, there’s no session, and therefore no way to receive SCSI sense information.
    We’re developing an alternative way to gather the same information on iSCSI storage with one LU per target, so PDL behavior is consistent.
    Thanks,
    Andy

    Reply
  4. Chogan

    Hi Loren, if the LUN is offlined only, then you can online it once again on the array, rescan the SAN, and the device should appear back on the ESX hosts, but detached if the procedure detailed above was followed. At this point, you would have to ‘attach’ the device to make it fully visible (via UI or CLI). If there was a VMFS filesystem on the LUN, you will have to do an additional ‘mount’ step to make the filesystem visible. GSS are putting together some KBs around this procedure which will be ready for GA.

    Reply
  5. Keith Farkas

    A very informative article, Cormac. One clarification regarding the interaction between vSphere HA and unmount.
    As you note, if a datastore is being used by vSphere HA for heartbeating, this activity must be stopped before the datastore can be unmounted. Note, however, that if the unmount is done using vCenter Server, the quiescing is done automatically. I.e., as part of the unmount workflow, vCenter Server will tell the vSphere HA agent on the host to stop using the datastore for heartbeating, and will, if there is a suitable alternative datastore, tell it to begin using one of these for heartbeating.
    But, if the unmount is done without using vCenter Server, as you note, HA may need to be manually reconfigured first. If vCenter Server is not used, first check if HA is using the datastore for heartbeating. It if is, then reconfigure HA to exclude this datastore from those vSphere HA selects its heartbeat datastores. Note that this exclusion will apply to all hosts in the cluster and not just the one on which the unmount is to be done. Once the unmount has been completed, if you plan to leave the datastore mounted to other hosts, you should reconfigure HA to allow it the datastore to be used for heartbeating again.

    Reply
  6. mayonnaise

    Hi! Very usefull article!
    Question – if host is only in SAN zoning with array, but no luns are presented, does this mean that host receives SCSI sense codes?

    Reply
  7. Chogan

    Hi there,
    Yes, we could still receive SCSI sense codes, but they would come from the HBA (Host) rather than from the Target/Device. For instance, in the case of Fiber Channel, an FC cable disconnect would send a SCSI sense code of NO CONNECT up to the virtual SCSI layer, even without the host seeing the target or any LUNs.

    Reply
  8. mayonnaise

    Thnx!
    Pls clarify, if I had a lun in array and then SAN admin only unpresent this lun from ESXi hosts(from ESXi side no unmount/detach operations done), so then I will dont have APD but unplanned PDL because array sends SCSI sense codes to hosts? Right?

    Reply
  9. mayonnaise

    And another question. Will APD occur soonner on hosts which have powered on VMs than on hosts without VMs ? Is there any dependency of VMs executing?

    Reply
  10. Chogan

    On your first point, yes, this is one of the scenarios where PDL should now happen. However it will be dependant on the type of SCSI sense codes returned by the array. We have worked a significant amount with our storage array partners to identify the sense codes returned when such a condition occurs, but it may be that some arrays return something we’re not expecting so we may still enter APD.
    On your second point, generating I/Os to a device (either from VMkernel, userworlds like hostd, or via VMs) would certainly cause APD or PDL state to be entered sooner rather than later. There is no dependency on VMs executing. A simple rescan of the SAN could also lead to it.

    Reply
  11. Wen Yang

    Thanks! This is very informative and helps me understand how APL and PDL are handled in vSphere 5.0.
    On the other side, while I am doing storage testing with Esxi 5, my storage configuration changes from time to time. Due to large number of LUNs and NFS shares, there is a lot of manual work to unmount and detach them every time I change the storage configuration, or simply reboot my storage controller. Currently, when I remove the LUNs/Shares on storage without unmounting/detatching them from ESX, those datastore and storage devices will become stale (greyed out in vSphere client), and and the ESX sometimes hang at rescan HBA.
    I am wondering, for testing purpose only, if there is any settings I can tweak to have the ESX to let go those PDL LUNs/Shares, and simply detects what’s available at present. That way, i would make it more efficient for my storage testing.
    Any advices on that would be highly appreciated! Thanks!

    Reply
  12. Cormac Hogan

    I am not aware of any setting to help you mitigate this in 5.0. However the next release of vSphere does have a number of new enhancements around APD & PDL which will hopefully help. I hope to be able to talk about these soon.

    Reply
  13. George

    Equallogic is one storage vendor that presents one LUN per target, so no sense codes is received by ESXi 5.
    Commvault backups rely on temporarily mounting volume snapshots and when they are removed, the LUNs are un-presented and deleted by Equallogic before the ESXi host had time to properly unmunt these temporarily datastores, which causes APD and the host disconnects from VCenter because of re-scaning issues.

    Reply
  14. Cormac Hogan

    We are looking to handle this condition on single-target, single LUN arrays going forward. I can’t say anymore about this right now except to keep following the blog.

    Reply
  15. Phil D.

    We had this happen to us on Friday past. (ESXi 4.1 U3). A bad HBA in one host kazooed a single LUN for the all hosts using that LUN. All of the hosts were disconnected from the vCentre until the “bad” host was powered off and the VMs on that LUN crashed.

    Luckily we knew which host was “bad” because it had other symptoms. If we hadn’t, I’d probably be writing this from home.(unemployed).

    This is crazy and needs to be reworked so that a bad LUN means a bad LUN rather than a bad datacentre!

    Reply
  16. Chris S.

    Great explanation, thanks! One question to NFS datastores. How will NFS Datastores be handled in an PDL. With NFS we will have no SCSI sens code, right? Can you explane me how this is handled?

    Thanks,
    Chris

    Reply
      1. Chris S.

        Hello Cormac

        Thanks for your answer! I have read the paper you mentioned. There is one question I did not find an answer in there. If I implement a vMSC, for example with a NetApp and NFS datastores, Do I have to configure the PDL Parameter disk.terminateVMOnPDLDefault=TRUE in the /etc/vmware/settings file and das.maskCleanShutdownEnabled=True on HA cluster level? Does this makes sense because both parameters are related to PDL? PDL is SCSI sense code related as you mentioned in this blog, so would it help with NFS datastores also, or can I ignore those two settings?

        Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>