posted

0 Comments

Another deep-dive troubleshooting blog today from Nathan Small (twitter account: @vSphereStorage)
 
Description from customer:
 
Host is getting disconnected from vCenter and VMs are showing as inaccessible. Only one host is affected.
 
 
Analysis:
 
A quick review of the vmkernel log shows a log spew of H:0x7 errors to numerous LUNs. Here is a short snippet where you can see how frequently they are occurring (multiple times per second):
 
# cat /var/log/vmkernel.log
 
2016-01-13T18:54:42.994Z cpu68:8260)ScsiDeviceIO: 2326: Cmd(0x412540b96e80) 0x28, CmdSN 0x8000006b from world 11725 to dev “naa.600601601b703400a4f90c3d0668e311” failed H:0x7 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2016-01-13T18:54:43.027Z cpu68:8260)ScsiDeviceIO: 2326: Cmd(0x4125401b2580) 0x28, CmdSN 0x8000002e from world 11725 to dev “naa.600601601b70340064a24ada10fae211” failed H:0x7 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2016-01-13T18:54:43.030Z cpu68:8260)ScsiDeviceIO: 2326: Cmd(0x4125406d5380) 0x28, CmdSN 0x80000016 from world 11725 to dev “naa.600601601b7034000c70e4e610fae211” failed H:0x7 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2016-01-13T18:54:43.542Z cpu67:8259)ScsiDeviceIO: 2326: Cmd(0x412540748800) 0x28, CmdSN 0x80000045 from world 11725 to dev “naa.600601601b70340064a24ada10fae211” failed H:0x7 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2016-01-13T18:54:43.808Z cpu74:8266)ScsiDeviceIO: 2326: Cmd(0x412541229040) 0x28, CmdSN 0x8000003c from world 11725 to dev “naa.600601601b7034008e56670a11fae211” failed H:0x7 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2016-01-13T18:54:44.088Z cpu38:8230)ScsiDeviceIO: 2326: Cmd(0x4124c0ff4f80) 0x28, CmdSN 0x80000030 from world 11701 to dev “naa.600601601b703400220f77ab15fae211” failed H:0x7 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2016-01-13T18:54:44.180Z cpu74:8266)ScsiDeviceIO: 2326: Cmd(0x412540ccda80) 0x28, CmdSN 0x80000047 from world 11725 to dev “naa.600601601b70340042b582440668e311” failed H:0x7 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2016-01-13T18:54:44.741Z cpu61:8253)ScsiDeviceIO: 2326: Cmd(0x412540b94480) 0x28, CmdSN 0x80000051 from world 11725 to dev “naa.600601601b70340060918f5b0668e311” failed H:0x7 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2016-01-13T18:54:44.897Z cpu63:8255)ScsiDeviceIO: 2326: Cmd(0x412540ff3180) 0x28, CmdSN 0x8000007a from world 11725 to dev “naa.600601601b7034005c918f5b0668e311” failed H:0x7 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2016-01-13T18:54:45.355Z cpu78:8270)ScsiDeviceIO: 2326: Cmd(0x412540f3b2c0) 0x28, CmdSN 0x80000039 from world 11725 to dev “naa.600601601b70340060918f5b0668e311” failed H:0x7 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2016-01-13T18:54:45.522Z cpu70:8262)ScsiDeviceIO: 2326: Cmd(0x41254073d0c0) 0x28, CmdSN 0x8000002c from world 11725 to dev “naa.600601601b7034000e3e97350668e311” failed H:0x7 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2016-01-13T18:54:45.584Z cpu71:8263)ScsiDeviceIO: 2326: Cmd(0x412541021780) 0x28, CmdSN 0x80000067 from world 11725 to dev “naa.600601601b7034000e3e97350668e311” failed H:0x7 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2016-01-13T18:54:45.803Z cpu63:8255)ScsiDeviceIO: 2326: Cmd(0x412540d20480) 0x28, CmdSN 0x80000019 from world 11725 to dev “naa.600601601b703400d24fc7620668e311” failed H:0x7 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2016-01-13T18:54:46.253Z cpu74:8266)ScsiDeviceIO: 2326: Cmd(0x412540b96380) 0x28, CmdSN 0x8000006f from world 11725 to dev “naa.600601601b7034005e918f5b0668e311” failed H:0x7 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
 
The Host side error (H:0x7) literally translates to Storage Initiator Error, which makes it sounds like there is something physical wrong with the card. One needs to understand that this status is sent up the stack from the HBA driver so really it is up to the those that write the driver to use this status for certain conditions. As there are no accompanying errors from the HBA driver, which in this case is a Brocade HBA, this is all we have to work with without enabling verbose logging in the driver. Verbose logging requires a reboot so this is not always an option when investigating root cause. The exception would be that the issue in ongoing so rebooting a host to capture this data is a viable option.
 
Taking a LUN as an example from ‘esxcfg-mpath -b’ output to get a view of the paths and targets:
 
# esxcfg-mpath -b
 
naa.600601601b703400b6aa124c0668e311 : DGC Fibre Channel Disk (naa.600601601b703400b6aa124c0668e311)
   vmhba0:C0:T3:L20 LUN:20 state:active fc Adapter: WWNN: 20:00:74:86:7a:ae:1c:9a WWPN: 20:01:74:86:7a:ae:1c:9a  Target: WWNN: 50:06:01:60:c7:20:7a:a8 WWPN: 50:06:01:63:47:20:7a:a8
   vmhba1:C0:T3:L20 LUN:20 state:active fc Adapter: WWNN: 20:00:74:86:7a:ae:1c:9c WWPN: 20:01:74:86:7a:ae:1c:9c  Target: WWNN: 50:06:01:60:c7:20:7a:a8 WWPN: 50:06:01:60:47:24:7a:a8
   vmhba0:C0:T1:L20 LUN:20 state:active fc Adapter: WWNN: 20:00:74:86:7a:ae:1c:9a WWPN: 20:01:74:86:7a:ae:1c:9a  Target: WWNN: 50:06:01:60:c7:20:7a:a8 WWPN: 50:06:01:6b:47:20:7a:a8
   vmhba1:C0:T2:L20 LUN:20 state:active fc Adapter: WWNN: 20:00:74:86:7a:ae:1c:9c WWPN: 20:01:74:86:7a:ae:1c:9c  Target: WWNN: 50:06:01:60:c7:20:7a:a8 WWPN: 50:06:01:68:47:24:7a:a8
   vmhba2:C0:T3:L20 LUN:20 state:active fc Adapter: WWNN: 20:00:74:86:7a:ae:1c:32 WWPN: 20:01:74:86:7a:ae:1c:32  Target: WWNN: 50:06:01:60:c7:20:7a:a8 WWPN: 50:06:01:63:47:20:7a:a8
   vmhba3:C0:T3:L20 LUN:20 state:active fc Adapter: WWNN: 20:00:74:86:7a:ae:1c:34 WWPN: 20:01:74:86:7a:ae:1c:34  Target: WWNN: 50:06:01:60:c7:20:7a:a8 WWPN: 50:06:01:60:47:24:7a:a8
   vmhba2:C0:T1:L20 LUN:20 state:active fc Adapter: WWNN: 20:00:74:86:7a:ae:1c:32 WWPN: 20:01:74:86:7a:ae:1c:32  Target: WWNN: 50:06:01:60:c7:20:7a:a8 WWPN: 50:06:01:6b:47:20:7a:a8
   vmhba3:C0:T2:L20 LUN:20 state:active fc Adapter: WWNN: 20:00:74:86:7a:ae:1c:34 WWPN: 20:01:74:86:7a:ae:1c:34  Target: WWNN: 50:06:01:60:c7:20:7a:a8 WWPN: 50:06:01:68:47:24:7a:a8
 
Let’s look at the adapter statistics for all HBAs. I would recommend always using localcli over esxcli when troubleshoot as esxcli requires hostd to be functioning properly:
 
# localcli storage core adapter stats get
 
vmhba0:
   Successful Commands: 844542177
   Blocks Read: 243114868277
   Blocks Written: 25821448417
  Read Operations: 395494703
   Write Operations: 405753901
   Reserve Operations: 0
   Reservation Conflicts: 0
   Failed Commands: 35403
   Failed Blocks Read: 57744
   Failed Blocks Written: 16843
   Failed Read Operations: 8224
   Failed Write Operations: 16450
   Failed Reserve Operations: 0
   Total Splits: 0
   PAE Commands: 0
 
vmhba1:
   Successful Commands: 502595840 <– Far less successful commands than the other adapters
   Blocks Read: 116436597821
   Blocks Written: 16509939615
   Read Operations: 216572537
   Write Operations: 245276523
   Reserve Operations: 0
   Reservation Conflicts: 0
   Failed Commands: 10942696
   Failed Blocks Read: 12055379188 <– 12 billion failed blocks read! Other adapters are all less than 60,000
   Failed Blocks Written: 933809
   Failed Read Operations: 10895926
   Failed Write Operations: 25645
   Failed Reserve Operations: 0
   Total Splits: 0
   PAE Commands: 0
 
vmhba2:
   Successful Commands: 845976973
   Blocks Read: 244034940187
   Blocks Written: 26063852941
   Read Operations: 397564994
   Write Operations: 407538414
   Reserve Operations: 0
   Reservation Conflicts: 0
   Failed Commands: 40468
   Failed Blocks Read: 44157
   Failed Blocks Written: 18676
   Failed Read Operations: 5506
   Failed Write Operations: 12152
   Failed Reserve Operations: 0
   Total Splits: 0
   PAE Commands: 0
 
vmhba3:
   Successful Commands: 866718515
   Blocks Read: 249837164491
   Blocks Written: 26492209531
   Read Operations: 406367844
   Write Operations: 416901703
   Reserve Operations: 0
   Reservation Conflicts: 0
   Failed Commands: 37723
   Failed Blocks Read: 23191
   Failed Blocks Written: 139380
   Failed Read Operations: 7372
   Failed Write Operations: 14878
   Failed Reserve Operations: 0
   Total Splits: 0
   PAE Commands: 0
 
 
Let’s see how often the vmkernel.log reports messages for that HBA:
 
# cat vmkernel.log |grep vmhba0|wc -l
112
 
# cat vmkernel.log |grep vmhba1|wc -l
8474 <– over 8000 times this HBA is mentioned! This doesn’t mean they are all errors, of course, but based on the log spew we know is already occurring it means it likely is
 
# cat vmkernel.log |grep vmhba2|wc -l
222
 
# cat vmkernel.log |grep vmhba3|wc -l
335
 
Now let’s take a look at the zoning to see if multiple adapters are zoned to the exact same array targets (WWPN) in attempt to determine if the issue is possibly array side or HBA side:
 
# esxcfg-mpath -b
 
naa.600601601b703400b6aa124c0668e311 : DGC Fibre Channel Disk (naa.600601601b703400b6aa124c0668e311)
   vmhba0:C0:T3:L20 LUN:20 state:active fc Adapter: WWNN: 20:00:74:86:7a:ae:1c:9a WWPN: 20:01:74:86:7a:ae:1c:9a  Target: WWNN: 50:06:01:60:c7:20:7a:a8 WWPN: 50:06:01:63:47:20:7a:a8
   vmhba1:C0:T3:L20 LUN:20 state:active fc Adapter: WWNN: 20:00:74:86:7a:ae:1c:9c WWPN: 20:01:74:86:7a:ae:1c:9c  Target: WWNN: 50:06:01:60:c7:20:7a:a8 WWPN: 50:06:01:60:47:24:7a:a8
   vmhba0:C0:T1:L20 LUN:20 state:active fc Adapter: WWNN: 20:00:74:86:7a:ae:1c:9a WWPN: 20:01:74:86:7a:ae:1c:9a  Target: WWNN: 50:06:01:60:c7:20:7a:a8 WWPN: 50:06:01:6b:47:20:7a:a8
   vmhba1:C0:T2:L20 LUN:20 state:active fc Adapter: WWNN: 20:00:74:86:7a:ae:1c:9c WWPN: 20:01:74:86:7a:ae:1c:9c  Target: WWNN: 50:06:01:60:c7:20:7a:a8 WWPN: 50:06:01:68:47:24:7a:a8
   vmhba2:C0:T3:L20 LUN:20 state:active fc Adapter: WWNN: 20:00:74:86:7a:ae:1c:32 WWPN: 20:01:74:86:7a:ae:1c:32  Target: WWNN: 50:06:01:60:c7:20:7a:a8 WWPN: 50:06:01:63:47:20:7a:a8
   vmhba3:C0:T3:L20 LUN:20 state:active fc Adapter: WWNN: 20:00:74:86:7a:ae:1c:34 WWPN: 20:01:74:86:7a:ae:1c:34  Target: WWNN: 50:06:01:60:c7:20:7a:a8 WWPN: 50:06:01:60:47:24:7a:a8
   vmhba2:C0:T1:L20 LUN:20 state:active fc Adapter: WWNN: 20:00:74:86:7a:ae:1c:32 WWPN: 20:01:74:86:7a:ae:1c:32  Target: WWNN: 50:06:01:60:c7:20:7a:a8 WWPN: 50:06:01:6b:47:20:7a:a8
   vmhba3:C0:T2:L20 LUN:20 state:active fc Adapter: WWNN: 20:00:74:86:7a:ae:1c:34 WWPN: 20:01:74:86:7a:ae:1c:34  Target: WWNN: 50:06:01:60:c7:20:7a:a8 WWPN: 50:06:01:68:47:24:7a:a8
 
Let’s isolate the HBAs so they are easier to visually compare the WWPN of the array targets:
 
vmhba1:
 
   vmhba1:C0:T3:L20 LUN:20 state:active fc Adapter: WWNN: 20:00:74:86:7a:ae:1c:9c WWPN: 20:01:74:86:7a:ae:1c:9c  Target: WWNN: 50:06:01:60:c7:20:7a:a8 WWPN: 50:06:01:60:47:24:7a:a8
   vmhba1:C0:T2:L20 LUN:20 state:active fc Adapter: WWNN: 20:00:74:86:7a:ae:1c:9c WWPN: 20:01:74:86:7a:ae:1c:9c  Target: WWNN: 50:06:01:60:c7:20:7a:a8 WWPN: 50:06:01:68:47:24:7a:a8
 
vmhba3:
 
   vmhba3:C0:T3:L20 LUN:20 state:active fc Adapter: WWNN: 20:00:74:86:7a:ae:1c:34 WWPN: 20:01:74:86:7a:ae:1c:34  Target: WWNN: 50:06:01:60:c7:20:7a:a8 WWPN: 50:06:01:60:47:24:7a:a8
   vmhba3:C0:T2:L20 LUN:20 state:active fc Adapter: WWNN: 20:00:74:86:7a:ae:1c:34 WWPN: 20:01:74:86:7a:ae:1c:34  Target: WWNN: 50:06:01:60:c7:20:7a:a8 WWPN: 50:06:01:68:47:24:7a:a8
 
vmhba1 and vmhba3 are zoned to the exact same array ports yet only vmhba1 is experiencing communication issues/errors.
 
 
Let’s look at the driver information under /proc/scsi/bfa/ by viewing (cat) the node information:
 
Chip Revision: Rev-E
Manufacturer: Brocade
Model Description: Brocade-1741
Instance Num: 0
Serial Num: xxxxxxxxx32
Firmware Version: 3.2.3.2
Hardware Version: Rev-E
Bios Version: 3.2.3.2
Optrom Version: 3.2.3.2
Port Count: 2
WWNN: 20:00:74:86:7a:ae:1c:9a
WWPN: 20:01:74:86:7a:ae:1c:9a
Instance num: 0
Target ID: 0 WWPN: 50:06:01:6b:47:20:7b:04
Target ID: 1 WWPN: 50:06:01:6b:47:20:7a:a8
Target ID: 2 WWPN: 50:06:01:63:47:20:7b:04
Target ID: 3 WWPN: 50:06:01:63:47:20:7a:a8
 
Chip Revision: Rev-E
Manufacturer: Brocade
Model Description: Brocade-1741
Instance Num: 1
Serial Num: xxxxxxxxx32
Firmware Version: 3.2.3.2
Hardware Version: Rev-E
Bios Version: 3.2.3.2
Optrom Version: 3.2.3.2
Port Count: 2
WWNN: 20:00:74:86:7a:ae:1c:9c
WWPN: 20:01:74:86:7a:ae:1c:9c
Instance num: 1
Target ID: 0 WWPN: 50:06:01:60:47:24:7b:04
Target ID: 1 WWPN: 50:06:01:68:47:24:7b:04
Target ID: 3 WWPN: 50:06:01:60:47:24:7a:a8
Target ID: 2 WWPN: 50:06:01:68:47:24:7a:a8
 
Chip Revision: Rev-E
Manufacturer: Brocade
Model Description: Brocade-1741
Instance Num: 2
Serial Num: xxxxxxxxx2E
Firmware Version: 3.2.3.2
Hardware Version: Rev-E
Bios Version: 3.2.3.2
Optrom Version: 3.2.3.2
Port Count: 2
WWNN: 20:00:74:86:7a:ae:1c:32
WWPN: 20:01:74:86:7a:ae:1c:32
Instance num: 2
Target ID: 0 WWPN: 50:06:01:6b:47:20:7b:04
Target ID: 1 WWPN: 50:06:01:6b:47:20:7a:a8
Target ID: 2 WWPN: 50:06:01:63:47:20:7b:04
Target ID: 3 WWPN: 50:06:01:63:47:20:7a:a8
 
Chip Revision: Rev-E
Manufacturer: Brocade
Model Description: Brocade-1741
Instance Num: 3
Serial Num: xxxxxxxxx2E
Firmware Version: 3.2.3.2
Hardware Version: Rev-E
Bios Version: 3.2.3.2
Optrom Version: 3.2.3.2
Port Count: 2
WWNN: 20:00:74:86:7a:ae:1c:34
WWPN: 20:01:74:86:7a:ae:1c:34
Instance num: 3
Target ID: 0 WWPN: 50:06:01:60:47:24:7b:04
Target ID: 1 WWPN: 50:06:01:68:47:24:7b:04
Target ID: 2 WWPN: 50:06:01:68:47:24:7a:a8
Target ID: 3 WWPN: 50:06:01:60:47:24:7a:a8
 
So all HBAs are the same firmware, which is important from a observed consistency perspective. Had the firmware versions been different then there might be something to go on, or at least verify whether there are issues with that firmware level. Obviously they are using the same driver as well since only one is loaded in the kernel.
 
We can see not only by the shared serial number above but also by the lspci output that these are 2 port physical cards:
 
# lspci
 
000:007:00.0 Serial bus controller: Brocade Communications Systems, Inc. Brocade-1010/1020/1007/1741 [vmhba0]
000:007:00.1 Serial bus controller: Brocade Communications Systems, Inc. Brocade-1010/1020/1007/1741 [vmhba1]
000:009:00.0 Serial bus controller: Brocade Communications Systems, Inc. Brocade-1010/1020/1007/1741 [vmhba2]
000:009:00.1 Serial bus controller: Brocade Communications Systems, Inc. Brocade-1010/1020/1007/1741 [vmhba3]
 
The first set of numbers are read as Domain:Bus:Slot.Function so vmhba0 and vmhba1 are both on Domain 0, Bus 7, Slot 0, amd function 0 and 1 respectively, which means it is a dual port HBA.
 
So vmhba0 and vmhba1 are the same physical card yet only vmhba1 is showing errors. The HBA chips themselves on a dual port HBA are mostly independent of each other so at least this means there isn’t a problem with the board or circuitry they both share. I say mostly since the physical ports are independent of each other as well as the HBA chip however they do share the same physical board and connection on the motherboard.
 
This host is running EMC PowerPath VE so we know that in general the I/O loads is evenly distributed across all HBAs and paths evenly. I say in general as PowerPath VE is intelligent enough to use paths that exhibit more errors than other paths less frequently, as well as paths that are more latent.
 
I believe we may be looking at either a cable issue (loose, faulty, or bad GBIC) between vmhba1 and the switch or the switch port itself that vmhba1 is connected to. Here is why:
 
1. vmhba1 is seeing thousands upon thousands of errors while the other HBAs are very quiet
2. vmhba1 and vmhba3 are zoned to the exact same targets yet only vmhba1 is seeing errors
3. vmhba0 and vmhba1 are the same physical card yet only vmhba 1 is seeing errors
 
My recommendation would be to check the physical switch port error counters and possibly replace the cable to see if the errors subside. It is standard practice to reset the switch counters and monitor to ensure errors are still happening so may be needed to do that to validate that the CRC errors or other fabric errors are still occurring.
 
Cheers,
Nathan (twitter account: @vSphereStorage)