Home > Blogs > VMware Support Insider


Multiple Storage Path Failures and Network Drops

Here's our fourth log deep-dive from Nathan Small (Twitter handle: vSphereStorage). If you missed the first one, here it is.

History of the issue

A customer experienced multiple storage path failures and network drops in their environment.

There are quite of few instances of the array controller ports timing out to the adapter. When this occurs, it is observed on multiple hosts at the exact same time. A quick way to determine this from ESX hosts would be to grep for ‘rport’ in the vmkernel log: (ie: cat /var/log/vmkernel |grep rport):

Apr  6 19:07:20 esx1 vmkernel: 11:09:46:03.193 cpu5:4298)<3> rport-3:0-10: blocked FC remote port time out: saving binding
Apr  6 19:07:20 esx1 vmkernel: 11:09:46:03.193 cpu5:4298)<3> rport-3:0-11: blocked FC remote port time out: saving binding
Apr  6 19:07:20 esx3 vmkernel: 10:09:10:17.967 cpu20:4290)<3> rport-3:0-2: blocked FC remote port time out: saving binding
Apr  6 19:07:20 esx3 vmkernel: 10:09:10:17.967 cpu5:4292)<3> rport-3:0-3: blocked FC remote port time out: saving binding
Apr  6 19:07:20 esx4 vmkernel: 10:07:51:54.795 cpu15:4300)<3> rport-3:0-2: blocked FC remote port time out: saving binding
Apr  6 19:07:20 esx4 vmkernel: 10:07:51:54.795 cpu15:4300)<3> rport-3:0-3: blocked FC remote port time out: saving binding
Apr  6 19:07:20 esx8 vmkernel: 9:09:41:37.103 cpu0:4300)<3> rport-3:0-3: blocked FC remote port time out: saving binding
Apr  6 19:07:20 esx8 vmkernel: 9:09:41:37.103 cpu12:4294)<3> rport-3:0-2: blocked FC remote port time out: saving binding
Apr  6 19:07:22 esx2 vmkernel: 10:10:35:21.466 cpu20:4301)<3> rport-3:0-2: blocked FC remote port time out: saving binding
Apr  6 19:07:22 esx2 vmkernel: 10:10:35:21.466 cpu20:4301)<3> rport-3:0-3: blocked FC remote port time out: saving binding
Apr  6 19:19:14 esx1 vmkernel: 11:09:57:57.228 cpu4:4300)<3> rport-2:0-13: blocked FC remote port time out: saving binding
Apr  6 19:19:14 esx1 vmkernel: 11:09:57:57.228 cpu6:4295)<3> rport-2:0-12: blocked FC remote port time out: saving binding
Apr  6 19:19:14 esx2 vmkernel: 10:10:47:15.993 cpu19:4292)<3> rport-2:0-2: blocked FC remote port time out: saving binding
Apr  6 19:19:14 esx2 vmkernel: 10:10:47:15.993 cpu19:4292)<3> rport-2:0-3: blocked FC remote port time out: saving binding
Apr  6 19:19:14 esx8 vmkernel: 9:09:53:31.612 cpu2:4296)<3> rport-2:0-2: blocked FC remote port time out: saving binding
Apr  6 19:19:14 esx8 vmkernel: 9:09:53:31.612 cpu2:4296)<3> rport-2:0-3: blocked FC remote port time out: saving binding
Apr  6 19:19:15 esx3 vmkernel: 10:09:22:12.469 cpu10:4291)<3> rport-2:0-2: blocked FC remote port time out: saving binding
Apr  6 19:19:15 esx3 vmkernel: 10:09:22:12.469 cpu10:4291)<3> rport-2:0-3: blocked FC remote port time out: saving binding
Apr  6 19:19:15 esx4 vmkernel: 10:08:03:49.296 cpu17:4280)<3> rport-2:0-2: blocked FC remote port time out: saving binding
Apr  6 19:19:15 esx4 vmkernel: 10:08:03:49.296 cpu17:4280)<3> rport-2:0-3: blocked FC remote port time out: saving binding
Apr  6 19:28:56 esx1 vmkernel: 11:10:07:39.282 cpu3:4292)<3> rport-3:0-10: blocked FC remote port time out: saving binding
Apr  6 19:28:56 esx1 vmkernel: 11:10:07:39.282 cpu5:4288)<3> rport-3:0-11: blocked FC remote port time out: saving binding
Apr  6 19:28:56 esx2 vmkernel: 10:10:56:57.549 cpu19:4281)<3> rport-3:0-2: blocked FC remote port time out: saving binding
Apr  6 19:28:56 esx2 vmkernel: 10:10:56:57.549 cpu19:4281)<3> rport-3:0-3: blocked FC remote port time out: saving binding
Apr  6 19:28:56 esx3 vmkernel: 10:09:31:53.981 cpu9:4291)<3> rport-3:0-2: blocked FC remote port time out: saving binding
Apr  6 19:28:56 esx3 vmkernel: 10:09:31:53.981 cpu9:4291)<3> rport-3:0-3: blocked FC remote port time out: saving binding
Apr  6 19:28:56 esx4 vmkernel: 10:08:13:30.816 cpu19:4295)<3> rport-3:0-3: blocked FC remote port time out: saving binding
Apr  6 19:28:56 esx4 vmkernel: 10:08:13:30.816 cpu2:4296)<3> rport-3:0-2: blocked FC remote port time out: saving binding
Apr  6 19:28:56 esx8 vmkernel: 9:10:03:13.149 cpu21:4280)<3> rport-3:0-2: blocked FC remote port time out: saving binding
Apr  6 19:28:56 esx8 vmkernel: 9:10:03:13.149 cpu5:4289)<3> rport-3:0-3: blocked FC remote port time out: saving binding

Note: The ‘rport’ messages are only observed when an array controller port times out, not when it is visible again.

The message of “blocked FC remote port time out” literally translates to the array controller port connection timed out, so we mark the paths as dead but save the “binding” information in case the device ever comes back on the fabric, in which case the connection is reestablished and the paths change state from “dead” to “on”. This path transition can also be observed under /var/log/messages.

The above behavior is very concerning since we can see it affects multiple hosts simultaneously. An inspection into the underlying server hardware reveals that these are HP blades, which may explain some of this behavior since they can share the same infrastructure:

From dmidecode:

System Information
        Manufacturer: HP
        Product Name: ProLiant BL460c G7

BIOS Information
        Vendor: HP
        Version: I27
        Release Date: 05/05/2011

Since this customer is complaining about network drops as well, we need to take a closer look at the messages that surround the ‘rport’ messages. Let's start with the April 6th @ 19:07 outage that affected all hosts:

# cat vm-support-esx*/var/log/vmkernel*|grep "Apr  6 19:0"|grep -i link
Apr  6 19:02:20 esx1 vmkernel: 11:09:41:02.701 cpu9:4317)<3>lpfc820 0000:06:00.3: 3:2718 Clear Virtual Link Received for VPI 0×100 tag 0xa
Apr  6 19:07:10 esx1 vmkernel: 11:09:45:53.191 cpu8:4104)vmnic7 : 06:00.1 Link Down
Apr  6 19:07:10 esx1 vmkernel: 11:09:45:53.191 cpu3:4317)<3>lpfc820 0000:06:00.3: 3:1305 Link Down Event xd received Data: xd x20 x800100 x0 x0
Apr  6 19:07:21 esx2 vmkernel: 10:10:30:21.179 cpu20:4317)<3>lpfc820 0000:06:00.3: 3:2718 Clear Virtual Link Received for VPI 0×100 tag 0×4
Apr  6 19:07:22 esx2 vmkernel: 10:10:35:11.465 cpu20:4116)vmnic7 : 06:00.1 Link Down
Apr  6 19:07:22 esx2 vmkernel: 10:10:35:11.465 cpu12:4317)<3>lpfc820 0000:06:00.3: 3:1305 Link Down Event x7 received Data: x7 x20 x800100 x0 x0
Apr  6 19:07:10 esx3 vmkernel: 10:09:10:07.966 cpu4:5117)vmnic7 : 06:00.1 Link Down
Apr  6 19:07:10 esx3 vmkernel: 10:09:10:07.966 cpu5:4317)<3>lpfc820 0000:06:00.3: 3:1305 Link Down Event x3 received Data: x3 x20 x800110 x0 x0
Apr  6 19:07:10 esx4 vmkernel: 10:07:51:44.794 cpu18:4114)vmnic7 : 06:00.1 Link Down
Apr  6 19:07:10 esx4 vmkernel: 10:07:51:44.794 cpu19:4317)<3>lpfc820 0000:06:00.3: 3:1305 Link Down Event x3 received Data: x3 x20 x800110 x0 x0
Apr  6 19:07:13 esx8 vmkernel: 9:09:36:36.881 cpu8:4317)<3>lpfc820 0000:06:00.3: 3:2718 Clear Virtual Link Received for VPI 0×100 tag 0×2
Apr  6 19:07:14 esx8 vmkernel: 9:09:41:27.103 cpu2:5449)vmnic7 : 06:00.1 Link Down
Apr  6 19:07:14 esx8 vmkernel: 9:09:41:27.103 cpu9:4317)<3>lpfc820 0000:06:00.3: 3:1305 Link Down Event x5 received Data: x5 x20 x800100 x0 x0
 
We can see that not only are “lpfc” devices dropping, but vmnic7 is dropping too. Also, these devices are dropping at the exact same time, and across multiple hosts at the same time as well. The acronym ‘lpfc’ translates to Light Pulse Fiber Channel, which is an Emulex device. At this point we can assume that since both devices drop simultaneously, they are physically connected to the same device, perhaps a switch that supports both Ethernet and Fiber Channel/Fiber Channel over Ethernet.

Let’s have a look at the next outage to see if the same behavior was observed:

# cat vm-support-esx*/var/log/vmkernel*|grep "Apr  6 19:28"|grep -i link
Apr  6 19:28:46 esx1 vmkernel: 11:10:07:29.281 cpu4:8170)vmnic7 : 06:00.1 Link Down
Apr  6 19:28:46 esx1 vmkernel: 11:10:07:29.281 cpu9:4317)<3>lpfc820 0000:06:00.3: 3:1305 Link Down Event x14 received Data: x14 x20 x800110 x0 x0
Apr  6 19:28:46 esx2 vmkernel: 10:10:56:47.548 cpu20:4546)vmnic7 : 06:00.1 Link Down
Apr  6 19:28:46 esx2 vmkernel: 10:10:56:47.548 cpu14:4317)<3>lpfc820 0000:06:00.3: 3:1305 Link Down Event xe received Data: xe x20 x800110 x0 x0
Apr  6 19:28:46 esx3 vmkernel: 10:09:31:43.981 cpu10:5119)vmnic7 : 06:00.1 Link Down
Apr  6 19:28:46 esx3 vmkernel: 10:09:31:43.981 cpu3:4317)<3>lpfc820 0000:06:00.3: 3:1305 Link Down Event xa received Data: xa x20 x800110 x0 x0
Apr  6 19:28:46 esx4 vmkernel: 10:08:13:20.816 cpu2:4098)vmnic7 : 06:00.1 Link Down
Apr  6 19:28:46 esx4 vmkernel: 10:08:13:20.816 cpu13:4317)<3>lpfc820 0000:06:00.3: 3:1305 Link Down Event xa received Data: xa x20 x800110 x0 x0
Apr  6 19:28:46 esx8 vmkernel: 9:10:03:03.148 cpu6:4102)vmnic7 : 06:00.1 Link Down
Apr  6 19:28:46 esx8 vmkernel: 9:10:03:03.148 cpu7:4317)<3>lpfc820 0000:06:00.3: 3:1305 Link Down Event xc received Data: xc x20 x800110 x0 x0
 
In every outage, the same devices were affect: "lpfc820 0000:06:00.3" and "vmnic7 : 06:00.1" across all hosts simultaneously. The numbering in these devices literally translate to bus:slot:function, so the FC HBA is bus 6, Slot 0, and function 3 while the vmnic is bus 6, slot 0, function 1. The only possible way to have an FC device and NIC device share the same bus is if the PCI card hosting these devices is a CNA card, which a review of the driver proc node /proc/scsi/lpfc820 shows that this HBA is a CNA (Converge Network Adapter), which is capable of multiple functions:

# cat /proc/scsi/lpfc820/3
Emulex OneConnect FCoE SCSI 8.2.1.90.28
HP NC553m 2P FlexFabric Adptr on PCI bus 0000:06 device 02 irq 49 port 1 Logical Link Speed: 4000 Mbps
BoardNum: 2
ESX Adapter: vmhba3

Function 0 and 2 of this PCI device are being used, though function 0 is another NIC and function 2 is another HBA.

In the end, the root cause for this issue wasn’t a faulty switch but instead a faulty HP Virtual Connect Module (VCM). The reason why the other NIC and HBA weren’t affected by these outages was due to the fact that they were connected to a different VCM. Once the VCM was replaced, the FC path and network drops stopped.