posted

2 Comments

So I wanted to get this blog post out sooner rather than later as it might effect a significant number of customers. In a nutshell, if you perform array maintenance that requires you to reboot a storage controller, the probability of successful path failover is low. This is effectively due to stale entries in the fiber channel name server on Cisco MDS switches running NX-OS 7.3, which is a rather new code release. As the title suggests, this only affects FCoE HBAs, specifically ones that rely on our libfc/libfcoe stack for FCoE connectivity. Such HBAs would be Cisco fnic HBAs as well as a handful of Emulex FCoE HBAs and a couple others.

Here is an example of a successful path failover after receiving an RSCN (Register State Change Notification) from the array controller after performing a reboot:

Here is a breakdown of what you just read:

  1. RSCN is received from the array controller
  2. Operation is now is state = 9
  3. GPN_ID (Get Port Name ID) is issued to the switches but is rejected because the state is 9 (See http://lists.open-fcoe.org/pipermail/fcoe-devel/2009-June/002828.html)
  4. LibFC begins to remove the port information on the host
  5. Port enters LOGO (Logout) state from previous state, which was Ready
  6. LibFC Deletes the port information

After this the ESX host will failover to other available ports, which would be on the peer SP:

A Host status of H:0x1 means NO_CONNECT, hence the failover.

Now here is an example of the same operation on a Cisco MDS switch running NX-OS 7.3 when a storage controller on the array is rebooted:

Notice the difference? Here is a breakdown of what happened this time:

  1. RSCN is received from the array controller
  2. Operation is now is state = 9
  3. GPN_ID (Get Port Name ID) is issued to the switches but is NOT rejected
  4. Since GPN_ID is valid, LibFC issues an Address Discovery (ADISC)
  5. 20 seconds later the ADISC sent times out and this continues to occur every 20 seconds

The problem is that the ADISC will continue this behavior until the array controller completes the reboot and is back online:

What is actually happening here is that the Cisco MDS switches are quick to receive the RSCN from the array controller and pass it along to the host HBAs however due to a timing issue the entries for that array controller in the FCNS (Fiber Channel Name Server) database are still present when the host HBAs issue the GPN_ID so the switches respond to that request instead of rejecting it. If you review the entry in http://lists.open-fcoe.org/pipermail/fcoe-devel/2009-June/002828.html you see that code was added to validate that the target is actually off the fabric instead of assuming it would be by the RSCN alone. There are various reasons to do this but suffice it to say that it is better to be safe than sorry in this instance.

Unfortunately there is no fix for this at this time, which is why this is potentially so impactful to our customers since it means they effectively are unable to perform array maintenance without the risk of VMs crashing or even corruption. Cisco is fixing this in 7.3(1), which due out in a few weeks.

Here are a couple of references regarding this issue:

 

Cheers,
Nathan Small
Technical Director
Global Support Services
VMware