Another post today in the Nathan Small (Twitter handle: vSphereStorage) series covering deep-dive log analysis in ESX/ESXi environments, where Nathan takes a look at a case where ESX 5 hosts randomly lose paths in an EMC VPLEX environment with HP EVA and EMC VMAX arrays behind a EMC VPLEX. The root cause for the path loss was the massive I/O load happening on the backend array, specifically the HP EVA. The resolution is to reduce the I/O load or spread it out over a longer period of time. Let's see how the story unfolded.
The ESX servers that are in front of the EMC VPLEX are experiencing path failures due to SCSI commands that timeout. These SCSI commands range from 10 bytes read and write operations to LUN heartbeat mechanisms (TEST_UNIT_READY). In this example, we can see that LUN 1 or “Vplex_HP1” disappeared for less than a second. Here is an example of the command timeout that leads to path failures. These messages can be pulled from /var/log/vobd.log
2012-05-22T07:02:04.680Z: [scsiCorrelator] 1680920122975us: [vob.scsi.scsipath.pathstate.dead] scsiPath vmhba2:C0:T1:L1 changed state from on
2012-05-22T07:02:04.681Z: [scsiCorrelator] 1680923124828us: [esx.problem.storage.redundancy.degraded] Path redundancy to storage device naa.6000144000000010a00f545a97c5704a degraded. Path vmhba2:C0:T1:L1 is down. Affected datastores: "Vplex_HP1".
2012-05-22T07:02:04.775Z: [scsiCorrelator] 1680920217911us: [vob.scsi.scsipath.pathstate.on] scsiPath vmhba2:C0:T1:L1 changed state from dead
2012-05-22T07:02:04.776Z: [scsiCorrelator] 1680923219503us: [esx.clear.storage.redundancy.restored] Path redundancy to storage device naa.6000144000000010a00f545a97c5704a (Datastores: "Vplex_HP1") restored. Path vmhba2:C0:T1:L1 is active again.
The above messages would be similar to what is observed in vCenter. Here are the entries from the vmkernel log. Note that extended Qlogic driver logging was enabled:
2012-05-22T07:02:04.680Z cpu12:4628)<6>scsi(8:0:1:1): TIMEOUT status detected 0x6-0x0
2012-05-22T07:02:04.680Z cpu7:2200)PowerPath: EmcpEsxLogEvent:1252: Info:emcp:MpxEsxTestAndUpdatePathState: Path "vmhba2:C0:T1:L1" is changing to dead from on.
Here we have a command timeout, and the command that timed out was a TEST_UNIT_READY or SCSI command 0x0. Even though we cannot confirm the exact command that failed, we can infer this to be the case since the path is marked as dead immediately. If any other command were to fail, we would see NMP errors and then send out a TEST_UNIT_READY to ensure the path is still good to use. Since this wasn’t the case, we have to assume that this command was sent out as path of the Disk.PathEval code that runs every 300 seconds (default).
There is another ESX environment connected to the same backend HP array, however these hosts do not connect to the EMC VPLEX to get to the HP EVA LUNs. During this same time period in that other environment we see many command failures due to a SCSI device status of QUEUE FULL, which means the array cannot keep up with the I/O being issued to it:
2012-05-22T07:02:02.087Z cpu10:2058)ScsiDeviceIO: 2309: Cmd(0x4124403d5ac0) 0x2a, CmdSN 0x800e0011 from world 1829086 to dev "naa.6001438005ded7fb0000500009510000" failed H:0x0 D:0x28 P:0x0 Possible sense data: 0x0 0x0 0x0.
2012-05-22T07:02:02.087Z cpu10:2058)ScsiDeviceIO: 2309: Cmd(0x4124414d9d40) 0x2a, CmdSN 0x800e007a from world 1829086 to dev "naa.6001438005ded7fb0000500009510000" failed H:0x0 D:0x28 P:0x0 Possible sense data: 0x0 0x0 0x0.
2012-05-22T07:02:04.755Z cpu12:2060)ScsiDeviceIO: 2309: Cmd(0x41244120a8c0) 0x2a, CmdSN 0x800e006e from world 1829086 to dev "naa.6001438005ded7fb00005000096c0000" failed H:0x0 D:0x28 P:0x0 Possible sense data: 0x0 0x0 0x0.
The Queue Full condition is represented by a device status of D:0x28, which translates to TASK_SET_FULL. See KB article Understanding SCSI device/target NMP errors/conditions in ESX/ESXi 4.x and ESXi 5.0 (1030381) for more details, as well as other SCSI Device Statuses.
Even though this HP array cannot keep up with the load, no path failures are observed on these ESX hosts, mostly because the ESX host is aware of the QFULL condition and knows it is a transient one so it knows the paths aren’t actually dead. The ESX hosts that are in front of the VPLEX never see this SCSI device status because the VPLEX does not return this to the hosts. Instead the VPLEX will retry the command for up to 16 seconds before giving up on the command. The ESX hosts have a SCSI command timeout of 5 seconds, so the hosts will give up on the command and send an abort for it well before the VPLEX has given up retrying the command. This would cause the VPLEX to see many command abort requests from the ESX hosts. The VPLEX has a counter to confirm that it is receiving the QFULL condition from the HP array. When reviewing this counter we learned that the QFULL condition was being returned to the VPLEX from the HP EVA array.
The ESX hosts in front of the VPLEX are marking paths as dead because many commands are timing out, particularly TEST_UNIT_READY commands. If the QFULL condition were to be returned to the ESX hosts by the VPLEX, no paths would be marked as dead because the commands failed due to a condition that is meant to be transient. In the end, the amount of load on the HP EVA array needs to be spread out over a longer period of time or staggered more efficiently so that a QFULL condition does not occur.
There has been code introduced into ESX that will automatically reduce the queue depth of LUNs returning a QFULL status. For more information, you can read about it in KB article: Controlling LUN queue depth throttling in VMware ESX/ESXi (1008113).