Home > Blogs > VMware Support Insider


Dropped Storage Paths Across all Hosts

Today, Nathan Small brings us another log deep-dive, this time for a problem with storage paths being dropped randomly across all hosts. Jumping right in…

 

First we will want to observe when the FC ports from the array timeout across all hosts, mostly to see if this events occurred simultaneously. The logs were collect via remote syslog (best practice when using ESXi 4.x hosts):

<181>1 2012-04-21T15:16:51-04:00 ESX006.local vmkernel – - – vmkernel: 23:20:59:27.727 cpu2:4449)<3> rport-3:0-0: blocked FC remote port time out: saving binding

<181>1 2012-04-21T15:16:51-04:00 ESX006.local vmkernel – - – vmkernel: 23:20:59:27.727 cpu10:4455)<3> rport-3:0-2: blocked FC remote port time out: saving binding

<181>1 2012-04-21T15:16:51-04:00 ESX006.local vmkernel – - – vmkernel: 23:20:59:27.727 cpu10:4455)<3> rport-3:0-3: blocked FC remote port time out: saving binding

<181>1 2012-04-21T15:16:51-04:00 ESX006.local vmkernel – - – vmkernel: 23:20:59:27.727 cpu2:4442)<3> rport-3:0-4: blocked FC remote port time out: saving binding

<181>1 2012-04-21T15:31:09-04:00 ESX003.local vmkernel – - – vmkernel: 24:00:21:14.984 cpu18:4446)<3> rport-4:0-0: blocked FC remote port time out: saving binding

<181>1 2012-04-21T15:31:09-04:00 ESX003.local vmkernel – - – vmkernel: 24:00:21:14.984 cpu18:4446)<3> rport-4:0-1: blocked FC remote port time out: saving binding

<181>1 2012-04-21T15:31:09-04:00 ESX003.local vmkernel – - – vmkernel: 24:00:21:14.984 cpu18:4446)<3> rport-4:0-2: blocked FC remote port time out: saving binding

<181>1 2012-04-21T15:31:09-04:00 ESX003.local vmkernel – - – vmkernel: 24:00:21:14.984 cpu18:4446)<3> rport-4:0-3: blocked FC remote port time out: saving binding

<181>1 2012-04-21T15:35:26-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:21:57.346 cpu7:4445)<3> rport-3:0-0: blocked FC remote port time out: saving binding

<181>1 2012-04-21T15:35:26-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:21:57.346 cpu5:4450)<3> rport-3:0-1: blocked FC remote port time out: saving binding

<181>1 2012-04-21T15:35:26-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:21:57.346 cpu5:4450)<3> rport-3:0-2: blocked FC remote port time out: saving binding

<181>1 2012-04-21T15:35:26-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:21:57.346 cpu3:4463)<3> rport-3:0-3: blocked FC remote port time out: saving binding

<181>1 2012-04-21T16:01:05-04:00 ESX101.local vmkernel – - – vmkernel: 23:20:38:39.231 cpu8:4459)<3> rport-4:0-0: blocked FC remote port time out: saving binding

<181>1 2012-04-21T16:01:05-04:00 ESX101.local vmkernel – - – vmkernel: 23:20:38:39.231 cpu8:4459)<3> rport-4:0-1: blocked FC remote port time out: saving binding

<181>1 2012-04-21T16:01:05-04:00 ESX101.local vmkernel – - – vmkernel: 23:20:38:39.231 cpu8:4459)<3> rport-4:0-2: blocked FC remote port time out: saving binding

<181>1 2012-04-21T16:01:05-04:00 ESX101.local vmkernel – - – vmkernel: 23:20:38:39.231 cpu15:4440)<3> rport-4:0-3: blocked FC remote port time out: saving binding

<181>1 2012-04-21T16:06:58-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:53:29.687 cpu14:4447)<3> rport-3:0-0: blocked FC remote port time out: saving binding

<181>1 2012-04-21T16:06:58-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:53:29.687 cpu21:4461)<3> rport-3:0-1: blocked FC remote port time out: saving binding

<181>1 2012-04-21T16:06:58-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:53:29.687 cpu21:4461)<3> rport-3:0-2: blocked FC remote port time out: saving binding

<181>1 2012-04-21T16:06:58-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:53:29.687 cpu21:4461)<3> rport-3:0-3: blocked FC remote port time out: saving binding

<181>1 2012-04-21T16:10:17-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:52:53.078 cpu5:4449)<3> rport-4:0-0: blocked FC remote port time out: saving binding

<181>1 2012-04-21T16:10:17-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:52:53.078 cpu5:4449)<3> rport-4:0-1: blocked FC remote port time out: saving binding

<181>1 2012-04-21T16:10:17-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:52:53.078 cpu5:4449)<3> rport-4:0-2: blocked FC remote port time out: saving binding

<181>1 2012-04-21T16:10:17-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:52:53.078 cpu5:4449)<3> rport-4:0-3: blocked FC remote port time out: saving binding

We can see above that all paths for a given HBA are dropping at the same time, however it affects hosts at different times. Since this issue had occurred multiple times in the past for the customer, we enabled verbose Qlogic driver logging, which would show when the HBA logs back into the targets. The default logging for the Qlogic driver will only show when a device has disappeared. FC ports restored:

<181>1 2012-04-21T15:18:16-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:00:52.639 cpu5:4717)scsi(3:1) rport_add: 5006016946e06b58 3a0700

<181>1 2012-04-21T15:18:16-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:00:52.764 cpu5:4717)scsi(3:3) rport_add: 5006016946e06b40 b10c00

<181>1 2012-04-21T15:18:16-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:00:52.901 cpu6:4717)scsi(3:0) rport_add: 5006016046e06b58 3a0500

<181>1 2012-04-21T15:18:16-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:00:52.957 cpu10:4717)scsi(3:2) rport_add: 5006016046e06b40 b10800

<181>1 2012-04-21T15:31:50-04:00 ESX003.local vmkernel – - – vmkernel: 24:00:21:55.834 cpu4:4719)scsi(4:0) rport_add: 5006016046e06b58 3a0500

<181>1 2012-04-21T15:32:10-04:00 ESX003.local vmkernel – - – vmkernel: 24:00:22:15.849 cpu7:4719)scsi(4:2) rport_add: 5006016046e06b40 b10800

<181>1 2012-04-21T15:32:10-04:00 ESX003.local vmkernel – - – vmkernel: 24:00:22:15.925 cpu7:4719)scsi(4:3) rport_add: 5006016946e06b40 b10c00

<181>1 2012-04-21T15:32:11-04:00 ESX003.local vmkernel – - – vmkernel: 24:00:22:16.813 cpu10:4719)scsi(4:1) rport_add: 5006016946e06b58 3a0700

<181>1 2012-04-21T15:36:26-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:22:58.151 cpu19:4717)scsi(3:0) rport_add: 5006016046e06b58 3a0500

<181>1 2012-04-21T15:36:47-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:23:18.201 cpu19:4717)scsi(3:2) rport_add: 5006016046e06b40 b10800

<181>1 2012-04-21T15:36:47-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:23:18.279 cpu19:4717)scsi(3:3) rport_add: 5006016946e06b40 b10c00

<181>1 2012-04-21T15:36:47-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:23:19.057 cpu15:4717)scsi(3:1) rport_add: 5006016946e06b58 3a0700

<181>1 2012-04-21T16:02:05-04:00 ESX101.local vmkernel – - – vmkernel: 23:20:39:40.071 cpu17:4719)scsi(4:1) rport_add: 5006016946e06b58 3a0700

<181>1 2012-04-21T16:02:05-04:00 ESX101.local vmkernel – - – vmkernel: 23:20:39:40.117 cpu16:4719)scsi(4:2) rport_add: 5006016046e06b40 b10800

<181>1 2012-04-21T16:02:05-04:00 ESX101.local vmkernel – - – vmkernel: 23:20:39:40.195 cpu16:4719)scsi(4:3) rport_add: 5006016946e06b40 b10c00

<181>1 2012-04-21T16:02:06-04:00 ESX101.local vmkernel – - – vmkernel: 23:20:39:40.989 cpu13:4719)scsi(4:0) rport_add: 5006016046e06b58 3a0500

<181>1 2012-04-21T16:07:39-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:54:10.470 cpu20:4717)scsi(3:0) rport_add: 5006016046e06b58 3a0500

<181>1 2012-04-21T16:07:41-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:54:12.630 cpu12:4717)scsi(3:2) rport_add: 5006016046e06b40 b10800

<181>1 2012-04-21T16:07:41-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:54:12.707 cpu18:4717)scsi(3:3) rport_add: 5006016946e06b40 b10c00

<181>1 2012-04-21T16:07:41-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:54:12.969 cpu18:4717)scsi(3:1) rport_add: 5006016946e06b58 3a0700

<181>1 2012-04-21T16:10:28-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:53:04.781 cpu9:4718)scsi(4:0) rport_add: 5006016146e06b58 190600

<181>1 2012-04-21T16:10:28-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:53:04.844 cpu6:4718)scsi(4:1) rport_add: 5006016846e06b58 190700

<181>1 2012-04-21T16:10:28-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:53:04.929 cpu5:4718)scsi(4:2) rport_add: 5006016846e06b40 d30700

<181>1 2012-04-21T16:10:29-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:53:05.013 cpu4:4718)scsi(4:3) rport_add: 5006016146e06b40 d30900

Let's take a closer look at the ESX105 host outage:

<181>1 2012-04-21T15:35:16-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:21:47.346 cpu22:4118)scsi(3): Asynchronous LOOP DOWN mbx1=10h mbx2=0h mbx3=0h mbx7=0h.

<181>1 2012-04-21T15:35:16-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:21:47.346 cpu22:4118)<6>qla2xxx 0000:04:00.6: LOOP DOWN detected mbx1=10h mbx2=0h mbx3=0h mbx7=0h.

<181>1 2012-04-21T15:35:16-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:21:47.349 cpu22:4118)<6>qla2xxx 0000:04:00.6: scsi(3:2:66) Port down status: port-state=0×3

<181>1 2012-04-21T15:35:16-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:21:47.349 cpu22:4118)<6>qla2xxx 0000:04:00.6: scsi(3:2:66): FCP command status: 0×28-0×0 (0×20000) portid=b10800 oxid=0×806 ser=0x1728f030 cdb=120000 len=0xff rsp_info=0×0 resid=0×0 fw_resid=0×0

<181>1 2012-04-21T15:35:46-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:22:17.577 cpu22:4118)qla24xx_report_id_acquisition:format 1: scsi(3): VP[0] enabled – status 0 – with port id 45000f

<181>1 2012-04-21T15:35:46-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:22:17.587 cpu22:4118)scsi(3): Asynchronous PORT UPDATE.

<181>1 2012-04-21T15:35:46-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:22:17.587 cpu22:4118)<6>scsi(3): Port database changed 0001 0007 1700.

<181>1 2012-04-21T15:35:46-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:22:17.976 cpu14:4717)scsi(3): qla2x00_reset_marker()

<181>1 2012-04-21T15:35:46-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:22:17.976 cpu14:4717)scsi(3): qla2x00_loop_resync()

<181>1 2012-04-21T15:35:47-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:22:18.574 cpu22:4118)scsi(3): Asynchronous PORT UPDATE ignored 0002/0007/0b00.

<181>1 2012-04-21T15:35:47-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:22:18.876 cpu22:4118)scsi(3): Asynchronous PORT UPDATE ignored 0002/0007/0b00.

<181>1 2012-04-21T15:35:48-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:22:19.188 cpu22:4118)scsi(3): Asynchronous PORT UPDATE ignored 0001/0007/0b00.

<181>1 2012-04-21T15:35:56-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:22:27.605 cpu22:4118)scsi(3): Asynchronous PORT UPDATE ignored 0001/0007/0b00.

<181>1 2012-04-21T15:35:58-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:22:29.602 cpu22:4118)scsi(3): Asynchronous PORT UPDATE ignored 0002/0007/0b00.

Physical connection comes back:

<181>1 2012-04-21T15:36:26-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:22:57.558 cpu22:4118)scsi(3): Asynchronous LOOP UP (10 Gbps).

<181>1 2012-04-21T15:36:26-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:22:57.558 cpu22:4118)<6>qla2xxx 0000:04:00.6: LOOP UP detected (10 Gbps).

<181>1 2012-04-21T15:36:47-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:23:18.201 cpu19:4717)scsi(3:2) rport_add: 5006016046e06b40 b10800

<181>1 2012-04-21T15:36:47-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:23:18.279 cpu19:4717)scsi(3): LOOP READY

<181>1 2012-04-21T15:36:47-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:23:18.279 cpu19:4717)scsi(3): qla2x00_loop_resync – end

Second drop occurs:

<181>1 2012-04-21T16:06:48-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:53:19.685 cpu22:16844)scsi(3): Asynchronous LOOP DOWN mbx1=10h mbx2=0h mbx3=0h mbx7=0h.

<181>1 2012-04-21T16:06:48-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:53:19.685 cpu22:16844)<6>qla2xxx 0000:04:00.6: LOOP DOWN detected mbx1=10h mbx2=0h mbx3=0h mbx7=0h.

<181>1 2012-04-21T16:06:52-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:53:23.897 cpu3:4193)scsi(3): fcport-0 – port retry count: 0 remaining

<181>1 2012-04-21T16:06:52-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:53:23.897 cpu3:4193)scsi(3): fcport-1 – port retry count: 0 remaining

<181>1 2012-04-21T16:06:52-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:53:23.897 cpu3:4193)scsi(3): fcport-2 – port retry count: 0 remaining

<181>1 2012-04-21T16:06:52-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:53:23.897 cpu3:4193)scsi(3): fcport-3 – port retry count: 0 remaining

<181>1 2012-04-21T16:06:58-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:53:29.687 cpu14:4447)<3> rport-3:0-0: blocked FC remote port time out: saving binding

<181>1 2012-04-21T16:06:58-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:53:29.687 cpu21:4461)<3> rport-3:0-1: blocked FC remote port time out: saving binding

<181>1 2012-04-21T16:06:58-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:53:29.687 cpu21:4461)<3> rport-3:0-2: blocked FC remote port time out: saving binding

<181>1 2012-04-21T16:06:58-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:53:29.687 cpu21:4461)<3> rport-3:0-3: blocked FC remote port time out: saving binding

Device is visible 2 seconds later:

<181>1 2012-04-21T16:07:18-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:53:49.759 cpu22:4118)scsi(3): DCBX Completed — 0000 0000 0000

<181>1 2012-04-21T16:07:18-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:53:49.917 cpu22:4118)qla24xx_report_id_acquisition:format 1: scsi(3): VP[0] enabled – status 0 – with port id 45000f

<181>1 2012-04-21T16:07:18-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:53:49.925 cpu3:4099)scsi(3): Loop Down – aborting the queues before time expire

<181>1 2012-04-21T16:07:18-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:53:49.925 cpu17:4717)scsi(3): qla2x00_reset_marker()

<181>1 2012-04-21T16:07:18-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:53:49.926 cpu22:4118)scsi(3): Asynchronous PORT UPDATE.

<181>1 2012-04-21T16:07:18-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:53:49.926 cpu22:4118)<6>scsi(3): Port database changed 0002 0007 1700.

<181>1 2012-04-21T16:07:18-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:53:49.967 cpu22:16606)scsi(3): Asynchronous PORT UPDATE ignored 0001/0007/1700.

<181>1 2012-04-21T16:07:19-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:53:50.926 cpu21:4717)scsi(3): qla2x00_loop_resync()

<181>1 2012-04-21T16:07:21-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:53:52.205 cpu22:4118)scsi(3): Asynchronous PORT UPDATE ignored 0002/0007/0b00.

Then connection is established again:

<181>1 2012-04-21T16:07:38-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:54:09.892 cpu22:4118)scsi(3): Asynchronous LOOP UP (10 Gbps).

<181>1 2012-04-21T16:07:38-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:54:09.892 cpu22:4118)<6>qla2xxx 0000:04:00.6: LOOP UP detected (10 Gbps).

<181>1 2012-04-21T16:07:38-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:54:09.893 cpu22:4118)scsi(3): Asynchronous PORT UPDATE ignored ffff/0006/0000.

<181>1 2012-04-21T16:07:38-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:54:10.018 cpu14:4717)scsi(3): F/W Ready – OK

<181>1 2012-04-21T16:07:41-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:54:12.707 cpu18:4717)scsi(3): LOOP READY

<181>1 2012-04-21T16:07:41-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:54:12.707 cpu18:4717)scsi(3): qla2x00_loop_resync – end

<181>1 2012-04-21T16:07:41-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:54:12.947 cpu18:4717)scsi(3): qla2x00_port_login()

Let's compare with another host:

<181>1 2012-04-21T15:16:41-04:00 ESX006.local vmkernel – - – vmkernel: 23:20:59:17.726 cpu2:4098)scsi(3): Asynchronous LOOP DOWN mbx1=10h mbx2=0h mbx3=0h mbx7=0h.

<181>1 2012-04-21T15:16:41-04:00 ESX006.local vmkernel – - – vmkernel: 23:20:59:17.726 cpu2:4098)<6>qla2xxx 0000:04:00.6: LOOP DOWN detected mbx1=10h mbx2=0h mbx3=0h mbx7=0h.

Port timeout:

<181>1 2012-04-21T15:16:45-04:00 ESX006.local vmkernel – - – vmkernel: 23:20:59:21.784 cpu2:14604)scsi(3): fcport-0 – port retry count: 0 remaining

<181>1 2012-04-21T15:16:45-04:00 ESX006.local vmkernel – - – vmkernel: 23:20:59:21.784 cpu2:14604)scsi(3): fcport-2 – port retry count: 0 remaining

<181>1 2012-04-21T15:16:45-04:00 ESX006.local vmkernel – - – vmkernel: 23:20:59:21.784 cpu2:14604)scsi(3): fcport-3 – port retry count: 0 remaining

<181>1 2012-04-21T15:16:45-04:00 ESX006.local vmkernel – - – vmkernel: 23:20:59:21.784 cpu2:14604)scsi(3): fcport-4 – port retry count: 0 remaining

<181>1 2012-04-21T15:16:51-04:00 ESX006.local vmkernel – - – vmkernel: 23:20:59:27.727 cpu2:4449)<3> rport-3:0-0: blocked FC remote port time out: saving binding

<181>1 2012-04-21T15:16:51-04:00 ESX006.local vmkernel – - – vmkernel: 23:20:59:27.727 cpu10:4455)<3> rport-3:0-2: blocked FC remote port time out: saving binding

<181>1 2012-04-21T15:16:51-04:00 ESX006.local vmkernel – - – vmkernel: 23:20:59:27.727 cpu10:4455)<3> rport-3:0-3: blocked FC remote port time out: saving binding

<181>1 2012-04-21T15:16:51-04:00 ESX006.local vmkernel – - – vmkernel: 23:20:59:27.727 cpu2:4442)<3> rport-3:0-4: blocked FC remote port time out: saving binding

Loop up 10 seconds later:

<181>1 2012-04-21T15:17:11-04:00 ESX006.local vmkernel – - – vmkernel: 23:20:59:47.988 cpu2:4098)scsi(3): Asynchronous PORT UPDATE.

<181>1 2012-04-21T15:17:11-04:00 ESX006.local vmkernel – - – vmkernel: 23:20:59:47.988 cpu2:4098)<6>scsi(3): Port database changed 0000 0007 1700.

<181>1 2012-04-21T15:17:12-04:00 ESX006.local vmkernel – - – vmkernel: 23:20:59:48.799 cpu11:4717)scsi(3): qla2x00_reset_marker()

<181>1 2012-04-21T15:17:12-04:00 ESX006.local vmkernel – - – vmkernel: 23:20:59:48.799 cpu11:4717)scsi(3): qla2x00_loop_resync()

<181>1 2012-04-21T15:17:12-04:00 ESX006.local vmkernel – - – vmkernel: 23:20:59:48.971 cpu2:3615303)scsi(3): Asynchronous PORT UPDATE ignored 0001/0007/0b00.

<181>1 2012-04-21T15:17:14-04:00 ESX006.local vmkernel – - – vmkernel: 23:20:59:50.363 cpu2:14604)scsi(3): Asynchronous PORT UPDATE ignored 0001/0007/0b00.

<181>1 2012-04-21T15:17:22-04:00 ESX006.local vmkernel – - – vmkernel: 23:20:59:58.986 cpu2:14606)scsi(3): Asynchronous PORT UPDATE ignored 0000/0007/0b00.

<181>1 2012-04-21T15:17:23-04:00 ESX006.local vmkernel – - – vmkernel: 23:20:59:59.971 cpu2:4098)scsi(3): Asynchronous PORT UPDATE ignored 0001/0007/0b00.

<181>1 2012-04-21T15:17:51-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:00:27.962 cpu2:4098)scsi(3): Asynchronous LOOP UP (10 Gbps).

<181>1 2012-04-21T15:17:51-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:00:27.962 cpu2:4098)<6>qla2xxx 0000:04:00.6: LOOP UP detected (10 Gbps).

If we look at this from just a LOOP perspective:

<181>1 2012-04-21T15:16:41-04:00 ESX006.local vmkernel – - – vmkernel: 23:20:59:17.726 cpu2:4098)scsi(3): Asynchronous LOOP DOWN mbx1=10h mbx2=0h mbx3=0h mbx7=0h.

<181>1 2012-04-21T15:16:41-04:00 ESX006.local vmkernel – - – vmkernel: 23:20:59:17.726 cpu2:4098)<6>qla2xxx 0000:04:00.6: LOOP DOWN detected mbx1=10h mbx2=0h mbx3=0h mbx7=0h.

<181>1 2012-04-21T15:17:51-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:00:27.962 cpu2:4098)scsi(3): Asynchronous LOOP UP (10 Gbps).

<181>1 2012-04-21T15:17:51-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:00:27.962 cpu2:4098)<6>qla2xxx 0000:04:00.6: LOOP UP detected (10 Gbps).

<181>1 2012-04-21T15:18:16-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:00:52.764 cpu2:4717)scsi(3): LOOP READY

LOOP detected down at 15:16:41 and then detected as up again 70 seconds later, however the LIP (Loop Initialization) did not complete until 25 seconds after that. Now let us look at the next host:

<181>1 2012-04-21T15:30:59-04:00 ESX003.local vmkernel – - – vmkernel: 24:00:21:04.984 cpu21:4117)scsi(4): Asynchronous LOOP DOWN mbx1=10h mbx2=0h mbx3=0h mbx7=0h.

<181>1 2012-04-21T15:30:59-04:00 ESX003.local vmkernel – - – vmkernel: 24:00:21:04.984 cpu21:4117)<6>qla2xxx 0000:04:00.6: LOOP DOWN detected mbx1=10h mbx2=0h mbx3=0h mbx7=0h.

<181>1 2012-04-21T15:31:49-04:00 ESX003.local vmkernel – - – vmkernel: 24:00:21:55.278 cpu21:4117)scsi(4): Asynchronous LOOP UP (10 Gbps).

<181>1 2012-04-21T15:31:49-04:00 ESX003.local vmkernel – - – vmkernel: 24:00:21:55.278 cpu21:4117)<6>qla2xxx 0000:04:00.6: LOOP UP detected (10 Gbps).

<181>1 2012-04-21T15:32:10-04:00 ESX003.local vmkernel – - – vmkernel: 24:00:22:15.926 cpu7:4719)scsi(4): LOOP READY

LOOP detected down at 15:30:59, which isn't the same time as the first host. The loop is up again 50 seconds later with the LIP finishing 20 seconds later. Onto the next host:

<181>1 2012-04-21T15:35:16-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:21:47.346 cpu22:4118)scsi(3): Asynchronous LOOP DOWN mbx1=10h mbx2=0h mbx3=0h mbx7=0h.

<181>1 2012-04-21T15:35:16-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:21:47.346 cpu22:4118)<6>qla2xxx 0000:04:00.6: LOOP DOWN detected mbx1=10h mbx2=0h mbx3=0h mbx7=0h.

<181>1 2012-04-21T15:36:26-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:22:57.558 cpu22:4118)scsi(3): Asynchronous LOOP UP (10 Gbps).

<181>1 2012-04-21T15:36:26-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:22:57.558 cpu22:4118)<6>qla2xxx 0000:04:00.6: LOOP UP detected (10 Gbps).

<181>1 2012-04-21T15:36:47-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:23:18.279 cpu19:4717)scsi(3): LOOP READY

LOOP detected down at 15:35:16, yet again not the same as either host. The loop is up again 70 seconds later with the LIP finishing 20 seconds later. Next host:

<181>1 2012-04-21T16:00:55-04:00 ESX101.local vmkernel – - – vmkernel: 23:20:38:29.229 cpu6:4102)scsi(4): Asynchronous LOOP DOWN mbx1=10h mbx2=0h mbx3=0h mbx7=0h.

<181>1 2012-04-21T16:00:55-04:00 ESX101.local vmkernel – - – vmkernel: 23:20:38:29.229 cpu6:4102)<6>qla2xxx 0000:04:00.6: LOOP DOWN detected mbx1=10h mbx2=0h mbx3=0h mbx7=0h.

<181>1 2012-04-21T16:01:45-04:00 ESX101.local vmkernel – - – vmkernel: 23:20:39:19.519 cpu6:14454)scsi(4): Asynchronous LOOP UP (10 Gbps).

<181>1 2012-04-21T16:01:45-04:00 ESX101.local vmkernel – - – vmkernel: 23:20:39:19.519 cpu6:14454)<6>qla2xxx 0000:04:00.6: LOOP UP detected (10 Gbps).

<181>1 2012-04-21T16:02:05-04:00 ESX101.local vmkernel – - – vmkernel: 23:20:39:40.195 cpu15:4719)scsi(4): LOOP READY

LOOP detected down at 16:00:55, still different from every other host. The loop is up again 50 seconds later with the LIP finishing 25 seconds later. Onto the last outage:

<181>1 2012-04-21T16:06:48-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:53:19.685 cpu22:16844)scsi(3): Asynchronous LOOP DOWN mbx1=10h mbx2=0h mbx3=0h mbx7=0h.

<181>1 2012-04-21T16:06:48-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:53:19.685 cpu22:16844)<6>qla2xxx 0000:04:00.6: LOOP DOWN detected mbx1=10h mbx2=0h mbx3=0h mbx7=0h.

<181>1 2012-04-21T16:07:38-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:54:09.892 cpu22:4118)scsi(3): Asynchronous LOOP UP (10 Gbps).

<181>1 2012-04-21T16:07:38-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:54:09.892 cpu22:4118)<6>qla2xxx 0000:04:00.6: LOOP UP detected (10 Gbps).

<181>1 2012-04-21T16:07:41-04:00 ESX105.local vmkernel – - – vmkernel: 23:01:54:12.707 cpu18:4717)scsi(3): LOOP READY

LOOP detected down at 16:06:48, LOOP up 50 seconds later with the LIP finishing in only 3 seconds.

Let's take a closer look at ESX006, since it lost paths to both HBAs, though at different times:

All paths to the first HBA drop:

<181>1 2012-04-21T15:16:51-04:00 ESX006.local vmkernel – - – vmkernel: 23:20:59:27.727 cpu2:4449)<3> rport-3:0-0: blocked FC remote port time out: saving binding

<181>1 2012-04-21T15:16:51-04:00 ESX006.local vmkernel – - – vmkernel: 23:20:59:27.727 cpu10:4455)<3> rport-3:0-2: blocked FC remote port time out: saving binding

<181>1 2012-04-21T15:16:51-04:00 ESX006.local vmkernel – - – vmkernel: 23:20:59:27.727 cpu10:4455)<3> rport-3:0-3: blocked FC remote port time out: saving binding

<181>1 2012-04-21T15:16:51-04:00 ESX006.local vmkernel – - – vmkernel: 23:20:59:27.727 cpu2:4442)<3> rport-3:0-4: blocked FC remote port time out: saving binding

All paths to the second HBA drop:

<181>1 2012-04-21T16:10:17-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:52:53.078 cpu5:4449)<3> rport-4:0-0: blocked FC remote port time out: saving binding

<181>1 2012-04-21T16:10:17-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:52:53.078 cpu5:4449)<3> rport-4:0-1: blocked FC remote port time out: saving binding

<181>1 2012-04-21T16:10:17-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:52:53.078 cpu5:4449)<3> rport-4:0-2: blocked FC remote port time out: saving binding

<181>1 2012-04-21T16:10:17-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:52:53.078 cpu5:4449)<3> rport-4:0-3: blocked FC remote port time out: saving binding

Why did this host observe path failures to the second HBA? No other host has issues with both HBAs.

Let's breakdown when the paths came back:

<181>1 2012-04-21T15:18:16-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:00:52.639 cpu5:4717)scsi(3:1) rport_add: 5006016946e06b58 3a0700

<181>1 2012-04-21T15:18:16-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:00:52.764 cpu5:4717)scsi(3:3) rport_add: 5006016946e06b40 b10c00

<181>1 2012-04-21T15:18:16-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:00:52.901 cpu6:4717)scsi(3:0) rport_add: 5006016046e06b58 3a0500

<181>1 2012-04-21T15:18:16-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:00:52.957 cpu10:4717)scsi(3:2) rport_add: 5006016046e06b40 b10800

<181>1 2012-04-21T16:10:28-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:53:04.781 cpu9:4718)scsi(4:0) rport_add: 5006016146e06b58 190600

<181>1 2012-04-21T16:10:28-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:53:04.844 cpu6:4718)scsi(4:1) rport_add: 5006016846e06b58 190700

<181>1 2012-04-21T16:10:28-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:53:04.929 cpu5:4718)scsi(4:2) rport_add: 5006016846e06b40 d30700

<181>1 2012-04-21T16:10:29-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:53:05.013 cpu4:4718)scsi(4:3) rport_add: 5006016146e06b40 d30900

Interestingly enough, we see another type of device drop its connection around the same time, but the connection is restored at the EXACT same time:

<14>1 2012-04-21T16:10:06-04:00 ESX006.local vobd – - – vobd: Apr 21 16:10:06.896: 2065962831343us: [vob.net.vmnic.linkstate.down] vmnic vmnic3 linkstate down.

<14>1 2012-04-21T16:10:27-04:00 ESX006.local vobd – - – vobd: Apr 21 16:10:27.936: 2065983869689us: [vob.net.vmnic.linkstate.up] vmnic vmnic3 linkstate up.

Is it just a coincidence that the problem suddenly spread to a vmnic instead of staying isolated to vmhbas? No, it is not. The HBAs and NICs in these hosts are the same physical device:

# cat /proc/scsi/qla2xxx/3

QLogic PCI-Express Single Port 10Gb CNA for QLE8240:

        FC Firmware version 4.09.34 (804), Driver version 841.k1.34.1-1vmw

Host Device Name vmhba2

This is a CNA device which is capable of being not only an FCoE HBA but an H/W iSCSI HBA as well:

# esxcfg-nics -l

Name    PCI           Driver      Link Speed     Duplex MAC Address       MTU    Description

vmnic2  0000:04:00.00 qlcnic      Up   10000Mbps Full   00:0e:1e:05:0a:e8 1500   QLogic Corp QLogic QLE8240 10GbE Single Channel CNA

vmnic3  0000:06:00.00 qlcnic      Up   10000Mbps Full   00:0e:1e:05:0c:90 1500   QLogic Corp QLogic QLE8240 10GbE Single Channel CNA

Let's compare the PCI device outputs for the vmhba:

<181>1 2012-04-21T15:16:41-04:00 ESX006.local vmkernel – - – vmkernel: 23:20:59:17.726 cpu2:4098)<6>qla2xxx 0000:04:00.6: LOOP DOWN detected mbx1=10h mbx2=0h mbx3=0h mbx7=0h.

<181>1 2012-04-21T15:17:51-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:00:27.962 cpu2:4098)<6>qla2xxx 0000:04:00.6: LOOP UP detected (10 Gbps).

<181>1 2012-04-21T16:10:28-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:53:03.966 cpu11:4718)<6>qla2xxx 0000:06:00.6: LOOP UP detected (10 Gbps).

This tells us that vmnic2 and vmhba2 are the same physical card (04:00.6 = bus 04, slot 00, function 6), while vmnic3 and vmhba3 are also the same card (06:00.6 = bus 06, slot 00, function 6).

Now, why was there no LOOP DOWN event for vmhba3 on this host? We clearly see a LOOP UP event, so what gives? Well, to answer that question we need to investigate the log messages surrounding the 'rport' messages, which indicate that the array controller port timed out:

Rport messages (again):

<181>1 2012-04-21T16:10:17-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:52:53.078 cpu5:4449)<3> rport-4:0-0: blocked FC remote port time out: saving binding

<181>1 2012-04-21T16:10:17-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:52:53.078 cpu5:4449)<3> rport-4:0-1: blocked FC remote port time out: saving binding

<181>1 2012-04-21T16:10:17-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:52:53.078 cpu5:4449)<3> rport-4:0-2: blocked FC remote port time out: saving binding

<181>1 2012-04-21T16:10:17-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:52:53.078 cpu5:4449)<3> rport-4:0-3: blocked FC remote port time out: saving binding

Therefore the logging frame of reference we need is at the 2012-04-21T16:10:17 time period:

<181>1 2012-04-21T16:10:17-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:52:53.026 cpu0:4454)<6>qlcnic 0000:06:00.0: Loading firmware from flash

This is the Qlogic CNA driver loading the firmware from flash for the same PCI device that we see a LOOP UP on. Let's grep for "qlcnic" for more information. The qlcnic is the Qlogic CNA driver:

<181>1 2012-04-21T16:10:06-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:52:42.441 cpu11:4450)<6>qlcnic 0000:06:00.0: Changed the adapter dev_state to NEED_RESET.

<181>1 2012-04-21T16:10:06-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:52:42.441 cpu11:4450)<6>qlcnic 0000:06:00.0: Adapter not in operational state, resetting adapter.

<181>1 2012-04-21T16:10:06-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:52:42.441 cpu17:4445)<6>qlcnic 0000:06:00.0: Disabled bus mastering.

<181>1 2012-04-21T16:10:06-04:00 ESX006.local vmkernel – - – vmkernel: <6>qlcnic 0000:06:00.0: PEG_HALT_STATUS1: 0×40006700, PEG_HALT_STATUS2: 0×662240,

<181>1 2012-04-21T16:10:08-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:52:44.445 cpu10:4454)<6>qlcnic 0000:06:00.0: Adapter reset acked by reset owner.

<181>1 2012-04-21T16:10:08-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:52:44.445 cpu10:4454)<6>qlcnic 0000:06:00.0: Changed adapter state to DEV_INITIALIZING.

<181>1 2012-04-21T16:10:11-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:52:47.520 cpu10:4454)<6>qlcnic 0000:06:00.0: Successfully collected firmware minidump.

<181>1 2012-04-21T16:10:17-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:52:53.026 cpu0:4454)<6>qlcnic 0000:06:00.0: Loading firmware from flash

<181>1 2012-04-21T16:10:25-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:53:01.806 cpu9:4454)<6>qlcnic 0000:06:00.0: Adapter state set to DEV_READY.

<181>1 2012-04-21T16:10:26-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:53:02.183 cpu8:4454)<6>qlcnic 0000:06:00.0: Dual XGb SFP+ LP, Board S/N: SFP_P3P, Chip rev: 0×54

<181>1 2012-04-21T16:10:26-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:53:02.183 cpu8:4454)<6>qlcnic 0000:06:00.0: Firmware v4.9.34 [legacy]

<181>1 2012-04-21T16:10:26-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:53:02.184 cpu8:4454)<6>qlcnic 0000:06:00.0: Re enabled bus mastering.

<181>1 2012-04-21T16:10:27-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:53:03.456 cpu16:4112)<6>qlcnic 0000:06:00.0: vmnic3 NIC Link is up

<181>1 2012-04-21T16:10:32-04:00 ESX006.local vmkernel – - – vmkernel: 23:21:53:08.871 cpu11:4145)<3>qlcnic_netq_ops: OP 16 not supported

Typically this behavior is observed with faulty hardware, but it isn’t impossible for this to be the result of a firmware bug or incompatibility with the switch they are connected with. When reviewing past incidences for this behavior in the logs, we can see that this same event occurred on other hosts:

<181>1 2012-04-07T06:07:32-04:00 ESX001.local vmkernel – - – vmkernel: 8:13:30:24.940 cpu16:4447)<6>qlcnic 0000:04:00.0: Changed the adapter dev_state to NEED_RESET.

<181>1 2012-04-07T06:07:32-04:00 ESX001.local vmkernel – - – vmkernel: 8:13:30:24.940 cpu16:4447)<6>qlcnic 0000:04:00.0: Adapter not in operational state, resetting adapter.

<181>1 2012-04-02T16:20:13-04:00 ESX106.local vmkernel – - – vmkernel: 40:22:32:23.462 cpu2:9060083)<6>qlcnic 0000:06:00.0: Changed the adapter dev_state to NEED_RESET.

<181>1 2012-04-02T16:20:13-04:00 ESX106.local vmkernel – - – vmkernel: 40:22:32:23.464 cpu2:9060083)<6>qlcnic 0000:06:00.0: Adapter not in operational state, resetting adapter.

Since we aren't positive why this is occurring, it is recommended to run the latest driver from QLogic to rule out old driver/firmware issues. It is important for us to confirm the issue still occurs on the latest firmware and driver, particularly if we need to involve the engineering team from the HBA vendor. This customer is currently running Driver version 841.k1.34.1-1vmw and 911.k1.1-19vmw is available on the QLogic site as of today, April 26th 2012.

http://driverdownloads.qlogic.com/QLogicDriverDownloads_UI/NewSearchResult.aspx?resourceid=MgA4ADAANwAwAA==&Search=&productid=MQAxADUANgA=&CategoryID=NAA5AA==&OperatingID=MQA2ADcA

Usually as a best practice we recommend running the latest firmware, as firmware updates, just like driver updates, contain bug fixes. We say usually since new code branches of firmware can sometimes come with a new set of bugs, so waiting until you a more mature release of that new code branch. A new code branch is define when you go from something like a v1.67 release to a v2.00 release. Obviously there are exceptions to the rule, but this generally applies.

Given the fact that multiple hosts are seeing this behavior at random times, it isn't impossible for there to be a bad batch of cards. When this is suspected to be the case, reviewing the serial numbers of the cards may reveal if they were from the same batch. In the case of this customer, the serial numbers of the cards were pretty close to being sequential. The HBA vendor confirmed that there may have been a very, very tiny batch of cards with bad firmware, so the affected HBAs were RMA'ed and replaced.