So I wanted to get this blog post out sooner rather than later as it might effect a significant number of customers. In a nutshell, if you perform array maintenance that requires you to reboot a storage controller, the probability of successful path failover is low. This is effectively due to stale entries in the fiber channel name server on Cisco MDS switches running NX-OS 7.3, which is a rather new code release. As the title suggests, this only affects FCoE HBAs, specifically ones that rely on our libfc/libfcoe stack for FCoE connectivity. Such HBAs would be Cisco fnic HBAs as well as a handful of Emulex FCoE HBAs and a couple others.
Here is an example of a successful path failover after receiving an RSCN (Register State Change Notification) from the array controller after performing a reboot:
1 2 3 4 5 6 7 8 9 10 11 12 13 | 2016-07-07T17:36:34.230Z cpu17:33461)<6>host4: disc: Received an RSCN event 2016-07-07T17:36:34.230Z cpu17:33461)<6>host4: disc: Port address format for port (e50800) 2016-07-07T17:36:34.230Z cpu17:33461)<6>host4: disc: RSCN received: not rediscovering. redisc 0 state 9 in_prog 0 2016-07-07T17:36:34.231Z cpu14:33474)<6>host4: disc: GPN_ID rejected reason 9 exp 1 2016-07-07T17:36:34.231Z cpu14:33474)<6>host4: rport e50800: Remove port 2016-07-07T17:36:34.231Z cpu14:33474)<6>host4: rport e50800: Port entered LOGO state from Ready state 2016-07-07T17:36:34.231Z cpu14:33474)<6>host4: rport e50800: Delete port 2016-07-07T17:36:34.231Z cpu54:33448)<6>host4: rport e50800: work event 3 2016-07-07T17:36:34.231Z cpu54:33448)<7>fnic : 4 :: fnic_rport_exch_reset called portid 0xe50800 2016-07-07T17:36:34.231Z cpu54:33448)<7>fnic : 4 :: fnic_rport_reset_exch: Issuing abts 2016-07-07T17:36:34.231Z cpu54:33448)<6>host4: rport e50800: Received a LOGO response closed 2016-07-07T17:36:34.231Z cpu54:33448)<6>host4: rport e50800: Received a LOGO response, but in state Delete 2016-07-07T17:36:34.231Z cpu54:33448)<6>host4: rport e50800: work delete |
Here is a breakdown of what you just read:
- RSCN is received from the array controller
- Operation is now is state = 9
- GPN_ID (Get Port Name ID) is issued to the switches but is rejected because the state is 9 (See http://lists.open-fcoe.org/pipermail/fcoe-devel/2009-June/002828.html)
- LibFC begins to remove the port information on the host
- Port enters LOGO (Logout) state from previous state, which was Ready
- LibFC Deletes the port information
After this the ESX host will failover to other available ports, which would be on the peer SP:
1 2 3 | 2016-07-07T17:36:44.233Z cpu33:33459)<3> rport-4:0-1: blocked FC remote port time out: saving binding 2016-07-07T17:36:44.233Z cpu55:33473)<7>fnic : 4 :: fnic_terminate_rport_io called wwpn 0x524a937aeb740513, wwnn0xffffffffffffffff, rport 0x0x4309b72f3c50, portid 0xffffffff 2016-07-07T17:36:44.257Z cpu52:33320)NMP: nmp_ThrottleLogForDevice:3298: Cmd 0x2a (0x43a659d15bc0, 36277) to dev "naa.624a93704d1296f5972642ea0001101c" on path "vmhba3:C0:T0:L1" Failed: H:0x1 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Act:FAILOVER |
A Host status of H:0x1 means NO_CONNECT, hence the failover.
Now here is an example of the same operation on a Cisco MDS switch running NX-OS 7.3 when a storage controller on the array is rebooted:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | 2016-07-14T19:02:03.551Z cpu47:33448)<6>host2: disc: Received an RSCN event 2016-07-14T19:02:03.551Z cpu47:33448)<6>host2: disc: Port address format for port (e50900) 2016-07-14T19:02:03.551Z cpu47:33448)<6>host2: disc: RSCN received: not rediscovering. redisc 0 state 9 in_prog 0 2016-07-14T19:02:03.557Z cpu47:33444)<6>host2: rport e50900: ADISC port 2016-07-14T19:02:03.557Z cpu47:33444)<6>host2: rport e50900: sending ADISC from Ready state 2016-07-14T19:02:23.558Z cpu47:33448)<6>host2: rport e50900: Received a ADISC response 2016-07-14T19:02:23.558Z cpu47:33448)<6>host2: rport e50900: Error 1 in state ADISC, retries 0 2016-07-14T19:02:23.558Z cpu47:33448)<6>host2: rport e50900: Port entered LOGO state from ADISC state 2016-07-14T19:02:43.560Z cpu2:33442)<6>host2: rport e50900: Received a LOGO response timeout 2016-07-14T19:02:43.560Z cpu2:33442)<6>host2: rport e50900: Error -1 in state LOGO, retrying 2016-07-14T19:02:43.560Z cpu58:33446)<6>host2: rport e50900: Port entered LOGO state from LOGO state 2016-07-14T19:03:03.563Z cpu54:33449)<6>host2: rport e50900: Received a LOGO response timeout 2016-07-14T19:03:03.563Z cpu54:33449)<6>host2: rport e50900: Error -1 in state LOGO, retrying 2016-07-14T19:03:03.563Z cpu2:33442)<6>host2: rport e50900: Port entered LOGO state from LOGO state 2016-07-14T19:03:23.565Z cpu32:33447)<6>host2: rport e50900: Received a LOGO response timeout 2016-07-14T19:03:23.565Z cpu32:33447)<6>host2: rport e50900: Error -1 in state LOGO, retrying 2016-07-14T19:03:23.565Z cpu54:33449)<6>host2: rport e50900: Port entered LOGO state from LOGO state 2016-07-14T19:03:43.567Z cpu50:33445)<6>host2: rport e50900: Received a LOGO response timeout 2016-07-14T19:03:43.567Z cpu50:33445)<6>host2: rport e50900: Error -1 in state LOGO, retrying 2016-07-14T19:03:43.567Z cpu32:33447)<6>host2: rport e50900: Port entered LOGO state from LOGO state 2016-07-14T19:04:03.568Z cpu54:33443)<6>host2: rport e50900: Received a LOGO response timeout 2016-07-14T19:04:03.568Z cpu54:33443)<6>host2: rport e50900: Error -1 in state LOGO, retrying 2016-07-14T19:04:03.569Z cpu32:33472)<6>host2: rport e50900: Port entered LOGO state from LOGO state 2016-07-14T19:04:43.573Z cpu20:33473)<6>host2: rport e50900: Received a LOGO response timeout 2016-07-14T19:04:43.573Z cpu20:33473)<6>host2: rport e50900: Error -1 in state LOGO, retrying 2016-07-14T19:04:43.573Z cpu54:33443)<6>host2: rport e50900: Port entered LOGO state from LOGO state |
Notice the difference? Here is a breakdown of what happened this time:
- RSCN is received from the array controller
- Operation is now is state = 9
- GPN_ID (Get Port Name ID) is issued to the switches but is NOT rejected
- Since GPN_ID is valid, LibFC issues an Address Discovery (ADISC)
- 20 seconds later the ADISC sent times out and this continues to occur every 20 seconds
The problem is that the ADISC will continue this behavior until the array controller completes the reboot and is back online:
1 2 3 4 5 6 7 8 9 10 11 12 | 2016-07-14T19:04:47.276Z cpu56:33451)<6>host2: disc: Received an RSCN event 2016-07-14T19:04:47.276Z cpu56:33451)<6>host2: disc: Port address format for port (e50900) 2016-07-14T19:04:47.276Z cpu56:33451)<6>host2: disc: RSCN received: not rediscovering. redisc 0 state 9 in_prog 0 2016-07-14T19:04:47.277Z cpu20:33454)<6>host2: rport e50900: Login to port 2016-07-14T19:04:47.277Z cpu20:33454)<6>host2: rport e50900: Port entered PLOGI state from LOGO state 2016-07-14T19:04:47.278Z cpu57:33456)<6>host2: rport e50900: Received a PLOGI accept 2016-07-14T19:04:47.278Z cpu57:33456)<6>host2: rport e50900: Port entered PRLI state from PLOGI state 2016-07-14T19:04:47.278Z cpu52:33458)<6>host2: rport e50900: Received a PRLI accept 2016-07-14T19:04:47.278Z cpu52:33458)<6>host2: rport e50900: PRLI spp_flags = 0x21 2016-07-14T19:04:47.278Z cpu52:33458)<6>host2: rport e50900: Port entered RTV state from PRLI state 2016-07-14T19:04:47.278Z cpu57:33452)<6>host2: rport e50900: Received a RTV reject 2016-07-14T19:04:47.278Z cpu57:33452)<6>host2: rport e50900: Port is Ready |
What is actually happening here is that the Cisco MDS switches are quick to receive the RSCN from the array controller and pass it along to the host HBAs however due to a timing issue the entries for that array controller in the FCNS (Fiber Channel Name Server) database are still present when the host HBAs issue the GPN_ID so the switches respond to that request instead of rejecting it. If you review the entry in http://lists.open-fcoe.org/pipermail/fcoe-devel/2009-June/002828.html you see that code was added to validate that the target is actually off the fabric instead of assuming it would be by the RSCN alone. There are various reasons to do this but suffice it to say that it is better to be safe than sorry in this instance.
Unfortunately there is no fix for this at this time, which is why this is potentially so impactful to our customers since it means they effectively are unable to perform array maintenance without the risk of VMs crashing or even corruption. Cisco is fixing this in 7.3(1), which due out in a few weeks.
Here are a couple of references regarding this issue:
- VMware KB: ESXi path failover not occurring with Cisco MDS switches running NX-OS 7.3 (2146334)
- Cisco Bug: https://bst.cloudapps.cisco.com/bugsearch/bug/CSCva64432
Cheers,
Nathan Small
Technical Director
Global Support Services
VMware
Hello this isn’t exactly fully right. This can affect both FC and FCoE. (FCoE on MDS is rare but would apply) The issue is a timing problem where the switch isn’t completed cleaning up the database before the host sends a GPN_ID, thus it gets accepted and leads to issues.
This is being tracked via https://bst.cloudapps.cisco.com/bugsearch/bug/CSCva64432
FYI As has been mentioned, the fix for bug CSCva64432 is in NX-OS 7.3(1)D1(1). This release has been available on cisco.com starting on 9/11/2016.