From the Trenches

Storage Performance and Timeout Issues

Here’s our second log deep-dive from Nathan Small (Twitter handle: vSphereStorage). If you missed the first one, here it is.

History of the issue

General storage performance issue while experiencing timeouts and SCSI reservation issues (Timeout/conflict) on ESX 4.1 (VMware View) and ESX 5.0 environments. Need to determine why these events are being seen and how we can improve performance.
 
The first step when troubleshooting this type of issue is to take a frame of reference that is visible across all hosts. Here is the issue observed on one of the ESX 4.1 host:
 
ESXi 4.1:
 
Mar 28 17:12:58 vmkernel: 34:11:37:08.013 cpu13:4262)NMP: nmp_PathDetermineFailure: SCSI cmd RESERVE failed on path vmhba1:C0:T2:L1, reservation state on device naa.60060480000190100554533032324644 is unknown.
Mar 28 17:12:58 vmkernel: 34:11:37:08.013 cpu13:4262)ScsiDeviceIO: 1688: Command 0x16 to device “naa.60060480000190100554533032324644” failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
Mar 28 17:12:58 vmkernel: 34:11:37:08.013 cpu13:4262)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe: NMP device “naa.60060480000190100554533032324644” state in doubt; requested fast path state update…
Mar 28 17:12:58 vmkernel: 34:11:37:08.013 cpu14:4134)WARNING: FS3: 7120: Reservation error: Timeout
 
When reviewing the logs from the ESX 5.0 host, we see the issue observed at the same time:
 
ESXi 5.0:
 
2012-03-28T17:12:56.481Z cpu11:4107)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba2:C0:T3:L23, reservation state on device naa.60060480000190100554533033364445 is unknown.
2012-03-28T17:12:56.481Z cpu11:4107)NMP: nmp_ThrottleLogForDevice:2318: Cmd 0x16 (0x4124003d4740, 0) to dev “naa.60060480000190100554533033364445” on path “vmhba2:C0:T3:L23” Failed: H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Act:EVAL
2012-03-28T17:12:56.481Z cpu11:4107)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe:237:NMP device “naa.60060480000190100554533033364445” state in doubt; requested fast path state update…
2012-03-28T17:12:56.481Z cpu11:4107)ScsiDeviceIO: 2322: Cmd(0x4124003d4740) 0x16, CmdSN 0x23102 from world 0 to dev “naa.60060480000190100554533033364445” failed H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2012-03-28T17:12:56.481Z cpu1:4132)WARNING: FS3: 1835: Reservation error: Timeout
 
These logs show that the ESX storage stack is not aware of what the SCSI reservation state of the device is. This will happen when a SCSI RESERVE command is sent, doesn’t not complete to the target due to timeout, and then abort sent for the SCSI reserve doesn’t complete either due to timeout. This will usually happen for 1 of 2 reasons:
 
1. Fabric issues where frames are dropped
2. Array or array controller is overloaded
 
There are other messages occurring in the logs that also indicate an array performance issue:
 
2012-03-26T23:00:30.950Z cpu8:4131)FS3Misc: 1440: Long VMFS3 rsv time on ‘VMW02’ (held for 1457 msecs). # R: 1, # W: 1 bytesXfer: 2 sectors
2012-03-27T23:01:15.491Z cpu4:4133)FS3Misc: 1440: Long VMFS3 rsv time on ‘VMW04’ (held for 2640 msecs). # R: 1, # W: 1 bytesXfer: 2 sectors
2012-03-28T17:12:59.456Z cpu15:4131)FS3Misc: 1440: Long VMFS3 rsv time on ‘VMW07’ (held for 2967 msecs). # R: 1, # W: 1 bytesXfer: 2 sectors
2012-03-28T20:05:19.372Z cpu0:4131)FS3Misc: 1440: Long VMFS3 rsv time on ‘VMW07’ (held for 19196 msecs). # R: 1, # W: 1 bytesXfer: 2 sectors
 
See KB 1025299 for more information on this message: http://kb.vmware.com/kb/1025299
 
To simply state that this is a performance issue that should be investigated further by the storage team or the array vendor may be acceptable for some, however let’s try to dig into this further and provide more details if that is the direction we decide to go. To start, let’s take a closer look at the LUNs reporting the issue in both environments:
 
ESX 4.1:
 
Error message:
 
Mar 28 17:12:58 vmkernel: 34:11:37:08.013 cpu13:4262)NMP: nmp_PathDetermineFailure: SCSI cmd RESERVE failed on path vmhba1:C0:T2:L1, reservation state on device naa.60060480000190100554533032324644 is unknown.
 
LUN in question: naa.60060480000190100554533037394644
 
# esxcfg-scsidevs -m |grep naa.60060480000190100554533037394644
 
naa.60060480000190100554533037394644:1                           /vmfs/devices/disks/naa.60060480000190100554533037394644:1 4d433199-f78406e2-ccc3-1cc1de093452  0  Lun_56
 
# esxcfg-mpath -b|grep -A4 naa.60060480000190100554533032324644
 
naa.60060480000190100554533032324644 : EMC Fibre Channel Disk (naa.60060480000190100554533032324644)
   vmhba0:C0:T2:L1 LUN:1 state:active fc Adapter: WWNN: 20:00:00:25:b5:01:00:24 WWPN: 20:00:00:25:b5:01:0a:24  Target: WWNN: 50:06:04:82:d5:2d:12:8d WWPN: 50:06:04:82:d5:2d:12:8d
   vmhba1:C0:T2:L1 LUN:1 state:active fc Adapter: WWNN: 20:00:00:25:b5:01:00:24 WWPN: 20:00:00:25:b5:01:0b:24  Target: WWNN: 50:06:04:82:d5:2d:12:82 WWPN: 50:06:04:82:d5:2d:12:82
 
From this output we can see the WWPN (World Wide Port Name) of the connected array ports, which are 50:06:04:82:d5:2d:12:8d and 50:06:04:82:d5:2d:12:82.
 
ESX 5.0:
 
Error message:
 
2012-03-28T17:12:56.481Z cpu11:4107)NMP: nmp_PathDetermineFailure:2084: SCSI cmd RESERVE failed on path vmhba2:C0:T3:L23, reservation state on device naa.60060480000190100554533033364445 is unknown.
 
LUN in question: naa.60060480000190100554533033364445
 
# esxcfg-scsidevs -m |grep naa.60060480000190100554533033364445
 
naa.60060480000190100554533033364445:1                           /vmfs/devices/disks/naa.60060480000190100554533033364445:1 4e70c147-a779fe32-8747-0025b5010a6a  0  VMW07
 
# esxcfg-mpath -b| grep -A4 naa.60060480000190100554533033364445
 
naa.60060480000190100554533033364445 : EMC Fibre Channel Disk (naa.60060480000190100554533033364445)
   vmhba2:C0:T3:L23 LUN:23 state:active fc Adapter: WWNN: 20:00:00:25:b5:01:00:0d WWPN: 20:00:00:25:b5:01:0a:0d  Target: WWNN: 50:06:04:82:d5:2d:12:8d WWPN: 50:06:04:82:d5:2d:12:8d
   vmhba3:C0:T3:L23 LUN:23 state:active fc Adapter: WWNN: 20:00:00:25:b5:01:00:0d WWPN: 20:00:00:25:b5:01:0b:0d  Target: WWNN: 50:06:04:82:d5:2d:12:82 WWPN: 50:06:04:82:d5:2d:12:82
 
We can see from the esxcfg-mpath output that the LUN having the same issue in this cluster is connected to the same WWPN: 50:06:04:82:d5:2d:12:8d and 50:06:04:82:d5:2d:12:82
 
Given the fact that there are two separate environments connected to the same array controller (FA port, in this case), it is highly likely that the array controller is overloaded. The next step would be to collect performance information from the array side from this point forward so that when another event occurs the storage team or storage array vendor can review the health of the array at that timeframe to ensure that it can keep up with the I/O being generated.
 
In this particular case, the storage team and EMC became involved and confirmed that the FA port was overloaded. To distribute the load, and to ensure View best practices are followed, the VMware View environment was moved onto a different, dedicated FA on the array.

Comments

0 comments have been added so far

  1. I do not agree with your assesment unless ALL of our storage arrays are slow. We have the ‘Long VMFS3’ error on multiple hosts attached to multiple different arrays.

  2. Hi Anthony,
    You may seen these messages in your environment from time to time and all it really means is that it took longer than 200ms (default) to reserve the LUN, do an operation, and then release the LUN. This is merely a performance indicator, not an absolute, especially if these messages are seen very infrequently.
    If you are seeing these messages in your environment frequently then you may want to gather performance information from the arrays. Alternatively, if you mostly see these messages in the middle of the night or off hours, then it would be safe to ignore these messages since they are likely occurring due to backups, replication, or other load operations on the array.

  3. Experienced the SCSI “Reservation error: Timeout” issue yesterday on a cluster of ESX 4.1 hosts. They were only complaining about one of the LUNs. Very APD-like in the way that the hosts were affected. Having the LUN reset on the storage array side helped bring hosts back for the most part. Still investigating and working with vendors to determine cause.

Leave a Reply

Your email address will not be published. Required fields are marked *