Today we have a guest post from Karthick Sivaramakrishnan, who is a 3 year veteran at VMware. His primary field of expertise is vSphere Storage and Site Recovery Manager.
This blog post is centered around how ESXi handles unscheduled storage disconnects on vSphere 5.x and 6.x. An unscheduled storage disconnect means some issue in the vSphere environment has led to All-Paths-Down (APD) for a datastore. An APD situation will be seen when ESXi host does not have any path to communicate with a lun on the storage array.
ESXi host can encounter an APD under several conditions. As a result, we may end up having VMs running on a given datastore go down, the host could get disconnected from vCenter, and in worst cases ESXi could become unresponsive.
From vSphere version 5.x and onwards, we are able to discern whether a disconnect is permanent or transient. Ideally a transient disconnect leads to All Paths Down state and ESXi expects the device to have a temporary disconnect. When we see permanent device loss or PDL the device is expected to have a non-recoverable issue like a hardware error or the lun is unmapped.
In the below example we see all iSCSI datastores are in inactive state.
To determine what caused this issue we see ESXi logs, particularly vmkernel and vobd. This issue will be evident in the vmkernel logs.
2017-01-10T13:04:26.803Z cpu1:32896)StorageApdHandlerEv: 110: Device or filesystem with identifier [naa.6000eb31dffdc33a0000000000000028] has entered the All Paths Down state.
2017-01-10T13:04:26.818Z cpu0:32896)StorageApdHandlerEv: 110: Device or filesystem with identifier [naa.6000eb31dffdc33a000000000000002a] has entered the All Paths Down state.
2017-01-10T13:04:26.905Z: [scsiCorrelator] 475204262us: [esx.problem.storage.connectivity.lost] Lost connectivity to storage device naa.6000eb31dffdc33a0000000000000028. Path vmhba33:C0:T1:L0 is down. Affected datastores: “Green”.
2017-01-10T13:04:26.905Z: [scsiCorrelator] 475204695us: [esx.problem.storage.connectivity.lost] Lost connectivity to storage device naa.6000eb31dffdc33a000000000000002a. Path vmhba33:C0:T0:L0 is down. Affected datastores: “Grey”.
From these logs we understand that ESXi host has lost connectivity to the datastore. Any virtual machines using the affected datastore may become unresponsive. In this example while the datastores was mounted on ESXi, we lost the network uplink on the nic that was used for iSCSI connection. This was a transient issue and the datastore came up once the network uplink was restored.
In the below example we see Datastore Black is in inactive state.
If we look into the logs to determine whats going on we see these events.
2017-01-09T12:42:09.365Z cpu0:32888)ScsiDevice: 6878: Device naa.6000eb31dffdc33a0000000000000063 APD Notify PERM LOSS; token num:1
2017-01-09T12:42:09.366Z cpu1:32916)StorageApdHandler: 1066: Freeing APD handle 0x430180b88880 [naa.6000eb31dffdc33a0000000000000063]
2017-01-09T12:49:01.260Z cpu1:32786)WARNING: NMP: nmp_PathDetermineFailure:2973: Cmd (0xc1) PDL error (0x5/0x25/0x0) – path vmhba33:C0:T3:L0 device naa.6000eb31dffdc33a0000000000000063 – triggering path evaluation
2017-01-09T12:49:01.260Z cpu1:32786)ScsiDeviceIO: 2651: Cmd(0x439d802ec580) 0xfe, CmdSN 0x4b7 from world 32776 to dev “naa.6000eb31dffdc33a0000000000000063” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0.
2017-01-09T12:49:01.300Z cpu0:40210)WARNING: NMP: vmk_NmpSatpIssueTUR:1043: Device naa.6000eb31dffdc33a0000000000000063 path vmhba33:C0:T3:L0 has been unmapped from the array
After some time passes you will see this message:
2017-01-09T13:13:11.942Z cpu0:32872)ScsiDevice: 1718: Permanently inaccessible device :naa.6000eb31dffdc33a0000000000000063 has no more open connections. It is now safe to unmount datastores (if any) and delete the device.
In this case the lun was unmapped from the array for this host and that is not a transient issue. Sens data 0x5 0x25 0x0 corresponds to “LOGICAL UNIT NOT SUPPORTED” which indicates the device is in Permanent Device Loss (PDL) state. Once ESXi knows the device is in PDL state it does not wait for the device to return back.
ESXi only checks ASC/ASCQ and if it happens to be 0x25/0x0 or 0x68/0x0, it marks device as PDL.
VMware KB 2004684 has in-depth information around APD and PDL situations. It also talks about planned and unplanned PDL. You can read it here: Permanent Device Loss (PDL) and All-Paths-Down (APD) in vSphere 5.x and 6.x (2004684)
Further on in the hostd logs you will see some additional events that will correlate to storage connection. Look for the below event id’s.
Event ID : esx.problem.storage.connectivity.lost
“esx.problem.storage.connectivity.lost” event indicates a loss in connectivity to the specified storage device. Any virtual machines using the affected datastore may become unresponsive.
Event ID : esx.problem.scsi.device.state.permanentloss
“esx.problem.scsi.device.state.permanentloss” event indicates a permanent device loss.