Home > Blogs > Support Insider > Category Archives: Datacenter

Category Archives: Datacenter

Path failover may not be successful when using Cisco MDS Switches on NX-OS 7.3 and FCoE based HBAs

So I wanted to get this blog post out sooner rather than later as it might effect a significant number of customers. In a nutshell, if you perform array maintenance that requires you to reboot a storage controller, the probability of successful path failover is low. This is effectively due to stale entries in the fiber channel name server on Cisco MDS switches running NX-OS 7.3, which is a rather new code release. As the title suggests, this only affects FCoE HBAs, specifically ones that rely on our libfc/libfcoe stack for FCoE connectivity. Such HBAs would be Cisco fnic HBAs as well as a handful of Emulex FCoE HBAs and a couple others.

Here is an example of a successful path failover after receiving an RSCN (Register State Change Notification) from the array controller after performing a reboot:

2016-07-07T17:36:34.230Z cpu17:33461)<6>host4: disc: Received an RSCN event
 2016-07-07T17:36:34.230Z cpu17:33461)<6>host4: disc: Port address format for port (e50800)
 2016-07-07T17:36:34.230Z cpu17:33461)<6>host4: disc: RSCN received: not rediscovering. redisc 0 state 9 in_prog 0
 2016-07-07T17:36:34.231Z cpu14:33474)<6>host4: disc: GPN_ID rejected reason 9 exp 1
 2016-07-07T17:36:34.231Z cpu14:33474)<6>host4: rport e50800: Remove port
 2016-07-07T17:36:34.231Z cpu14:33474)<6>host4: rport e50800: Port entered LOGO state from Ready state
 2016-07-07T17:36:34.231Z cpu14:33474)<6>host4: rport e50800: Delete port
 2016-07-07T17:36:34.231Z cpu54:33448)<6>host4: rport e50800: work event 3
 2016-07-07T17:36:34.231Z cpu54:33448)<7>fnic : 4 :: fnic_rport_exch_reset called portid 0xe50800
 2016-07-07T17:36:34.231Z cpu54:33448)<7>fnic : 4 :: fnic_rport_reset_exch: Issuing abts
 2016-07-07T17:36:34.231Z cpu54:33448)<6>host4: rport e50800: Received a LOGO response closed
 2016-07-07T17:36:34.231Z cpu54:33448)<6>host4: rport e50800: Received a LOGO response, but in state Delete
 2016-07-07T17:36:34.231Z cpu54:33448)<6>host4: rport e50800: work delete

Here is a breakdown of what you just read:

  1. RSCN is received from the array controller
  2. Operation is now is state = 9
  3. GPN_ID (Get Port Name ID) is issued to the switches but is rejected because the state is 9 (See http://lists.open-fcoe.org/pipermail/fcoe-devel/2009-June/002828.html)
  4. LibFC begins to remove the port information on the host
  5. Port enters LOGO (Logout) state from previous state, which was Ready
  6. LibFC Deletes the port information

After this the ESX host will failover to other available ports, which would be on the peer SP:

2016-07-07T17:36:44.233Z cpu33:33459)<3> rport-4:0-1: blocked FC remote port time out: saving binding
 2016-07-07T17:36:44.233Z cpu55:33473)<7>fnic : 4 :: fnic_terminate_rport_io called wwpn 0x524a937aeb740513, wwnn0xffffffffffffffff, rport 0x0x4309b72f3c50, portid 0xffffffff
 2016-07-07T17:36:44.257Z cpu52:33320)NMP: nmp_ThrottleLogForDevice:3298: Cmd 0x2a (0x43a659d15bc0, 36277) to dev "naa.624a93704d1296f5972642ea0001101c" on path "vmhba3:C0:T0:L1" Failed: H:0x1 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Act:FAILOVER

A Host status of H:0x1 means NO_CONNECT, hence the failover.

Now here is an example of the same operation on a Cisco MDS switch running NX-OS 7.3 when a storage controller on the array is rebooted:

2016-07-14T19:02:03.551Z cpu47:33448)<6>host2: disc: Received an RSCN event
 2016-07-14T19:02:03.551Z cpu47:33448)<6>host2: disc: Port address format for port (e50900)
 2016-07-14T19:02:03.551Z cpu47:33448)<6>host2: disc: RSCN received: not rediscovering. redisc 0 state 9 in_prog 0
 2016-07-14T19:02:03.557Z cpu47:33444)<6>host2: rport e50900: ADISC port
 2016-07-14T19:02:03.557Z cpu47:33444)<6>host2: rport e50900: sending ADISC from Ready state
 2016-07-14T19:02:23.558Z cpu47:33448)<6>host2: rport e50900: Received a ADISC response
 2016-07-14T19:02:23.558Z cpu47:33448)<6>host2: rport e50900: Error 1 in state ADISC, retries 0
 2016-07-14T19:02:23.558Z cpu47:33448)<6>host2: rport e50900: Port entered LOGO state from ADISC state
 2016-07-14T19:02:43.560Z cpu2:33442)<6>host2: rport e50900: Received a LOGO response timeout
 2016-07-14T19:02:43.560Z cpu2:33442)<6>host2: rport e50900: Error -1 in state LOGO, retrying
 2016-07-14T19:02:43.560Z cpu58:33446)<6>host2: rport e50900: Port entered LOGO state from LOGO state
 2016-07-14T19:03:03.563Z cpu54:33449)<6>host2: rport e50900: Received a LOGO response timeout
 2016-07-14T19:03:03.563Z cpu54:33449)<6>host2: rport e50900: Error -1 in state LOGO, retrying
 2016-07-14T19:03:03.563Z cpu2:33442)<6>host2: rport e50900: Port entered LOGO state from LOGO state
 2016-07-14T19:03:23.565Z cpu32:33447)<6>host2: rport e50900: Received a LOGO response timeout
 2016-07-14T19:03:23.565Z cpu32:33447)<6>host2: rport e50900: Error -1 in state LOGO, retrying
 2016-07-14T19:03:23.565Z cpu54:33449)<6>host2: rport e50900: Port entered LOGO state from LOGO state
 2016-07-14T19:03:43.567Z cpu50:33445)<6>host2: rport e50900: Received a LOGO response timeout
 2016-07-14T19:03:43.567Z cpu50:33445)<6>host2: rport e50900: Error -1 in state LOGO, retrying
 2016-07-14T19:03:43.567Z cpu32:33447)<6>host2: rport e50900: Port entered LOGO state from LOGO state
 2016-07-14T19:04:03.568Z cpu54:33443)<6>host2: rport e50900: Received a LOGO response timeout
 2016-07-14T19:04:03.568Z cpu54:33443)<6>host2: rport e50900: Error -1 in state LOGO, retrying
 2016-07-14T19:04:03.569Z cpu32:33472)<6>host2: rport e50900: Port entered LOGO state from LOGO state
 2016-07-14T19:04:43.573Z cpu20:33473)<6>host2: rport e50900: Received a LOGO response timeout
 2016-07-14T19:04:43.573Z cpu20:33473)<6>host2: rport e50900: Error -1 in state LOGO, retrying
 2016-07-14T19:04:43.573Z cpu54:33443)<6>host2: rport e50900: Port entered LOGO state from LOGO state

Notice the difference? Here is a breakdown of what happened this time:

  1. RSCN is received from the array controller
  2. Operation is now is state = 9
  3. GPN_ID (Get Port Name ID) is issued to the switches but is NOT rejected
  4. Since GPN_ID is valid, LibFC issues an Address Discovery (ADISC)
  5. 20 seconds later the ADISC sent times out and this continues to occur every 20 seconds

The problem is that the ADISC will continue this behavior until the array controller completes the reboot and is back online:

2016-07-14T19:04:47.276Z cpu56:33451)<6>host2: disc: Received an RSCN event
 2016-07-14T19:04:47.276Z cpu56:33451)<6>host2: disc: Port address format for port (e50900)
 2016-07-14T19:04:47.276Z cpu56:33451)<6>host2: disc: RSCN received: not rediscovering. redisc 0 state 9 in_prog 0
 2016-07-14T19:04:47.277Z cpu20:33454)<6>host2: rport e50900: Login to port
 2016-07-14T19:04:47.277Z cpu20:33454)<6>host2: rport e50900: Port entered PLOGI state from LOGO state
 2016-07-14T19:04:47.278Z cpu57:33456)<6>host2: rport e50900: Received a PLOGI accept
 2016-07-14T19:04:47.278Z cpu57:33456)<6>host2: rport e50900: Port entered PRLI state from PLOGI state
 2016-07-14T19:04:47.278Z cpu52:33458)<6>host2: rport e50900: Received a PRLI accept
 2016-07-14T19:04:47.278Z cpu52:33458)<6>host2: rport e50900: PRLI spp_flags = 0x21
 2016-07-14T19:04:47.278Z cpu52:33458)<6>host2: rport e50900: Port entered RTV state from PRLI state
 2016-07-14T19:04:47.278Z cpu57:33452)<6>host2: rport e50900: Received a RTV reject
 2016-07-14T19:04:47.278Z cpu57:33452)<6>host2: rport e50900: Port is Ready

What is actually happening here is that the Cisco MDS switches are quick to receive the RSCN from the array controller and pass it along to the host HBAs however due to a timing issue the entries for that array controller in the FCNS (Fiber Channel Name Server) database are still present when the host HBAs issue the GPN_ID so the switches respond to that request instead of rejecting it. If you review the entry in http://lists.open-fcoe.org/pipermail/fcoe-devel/2009-June/002828.html you see that code was added to validate that the target is actually off the fabric instead of assuming it would be by the RSCN alone. There are various reasons to do this but suffice it to say that it is better to be safe than sorry in this instance.

Unfortunately there is no fix for this at this time, which is why this is potentially so impactful to our customers since it means they effectively are unable to perform array maintenance without the risk of VMs crashing or even corruption. Cisco is fixing this in 7.3(1), which due out in a few weeks.

Here are a couple of references regarding this issue:

 

Cheers,
Nathan Small
Technical Director
Global Support Services
VMware

Top 20 vCenter Server articles for July 2016

Top 20Here is our Top 20 vCenter articles list for July 2016. This list is ranked by the number of times a VMware Support Request was resolved by following the steps in a published Knowledge Base article.

  1. Uploading diagnostic information for VMware using FTP
  2. Downloading, licensing, and using VMware products
  3. Licensing VMware vCenter Site Recovery Manager
  4. Collecting diagnostic information for VMware vCenter Server 4.x, 5.x and 6.0
  5. Using the VMware Knowledge Base
  6. Best practices for upgrading to vCenter Server 6.0
  7. ESXi hosts are no longer manageable after an upgrade
  8. Permanent Device Loss (PDL) and All-Paths-Down (APD) in vSphere 5.x and 6.x
  9. Consolidating snapshots in vSphere 5.x/6.0
  10. Diagnosing an ESXi/ESX host that is disconnected or not responding in VMware vCenter Server
  11. How to unlock and reset the vCenter SSO administrator password
  12. Resetting the VMware vCenter Server 5.x Inventory Service database
  13. Correlating build numbers and versions of VMware products
  14. Back up and restore vCenter Server Appliance/vCenter Server 6.0 vPostgres database
  15. Build numbers and versions of VMware vCenter Server
  16. Re-pointing and re-registering VMware vCenter Server 5.1 / 5.5 and components
  17. “Deprecated VMFS volume(s) found on the host” error in ESXi hosts
  18. vmware-dataservice-sca and vsphere-client status change from green to yellow
  19. Investigating virtual machine file locks on ESXi/ESX
  20. VMware End User License Agreements

Top 20 vCenter Server articles for June 2016

Top 20Here is our Top 20 vCenter articles list for June 2016. This list is ranked by the number of times a VMware Support Request was resolved by following the steps in a published Knowledge Base article.

  1. Purging old data from the database used by VMware vCenter Server
  2. ESXi 5.5 Update 3b and later hosts are no longer manageable after upgrade
  3. Resetting the VMware vCenter Server and vCenter Server Appliance 6.0 Inventory Service database
  4. Unlocking and resetting the VMware vCenter Single Sign-On administrator password
  5. Permanent Device Loss (PDL) and All-Paths-Down (APD) in vSphere 5.x and 6.x
  6. Upgrading to vCenter Server 6.0 best practices
  7. Correlating build numbers and versions of VMware products
  8. Update sequence for vSphere 6.0 and its compatible VMware products
  9. Stopping, starting, or restarting VMware vCenter Server services
  10. In vCenter Server 6.0, the vmware-dataservice-sca and vsphere-client status change from green to yellow continually
  11. Enabling EVC on a cluster when vCenter Server is running in a virtual machine
  12. The vpxd process becomes unresponsive after upgrading to VMware vCenter Server 5.5
  13. Migrating the vCenter Server database from SQL Express to full SQL Server
  14. Reducing the size of the vCenter Server database when the rollup scripts take a long time to run
  15. Consolidating snapshots in vSphere 5.x/6.0
  16. Back up and restore vCenter Server Appliance/vCenter Server 6.0 vPostgres database
  17. Diagnosing an ESXi/ESX host that is disconnected or not responding in VMware vCenter Server
  18. Build numbers and versions of VMware vCenter Server
  19. Increasing the size of a virtual disk
  20. Determining where growth is occurring in the VMware vCenter Server database

Windows 2008+ incremental backups become full backups in ESXi 6.0 b3825889

vmware_tools_iconVMware is actively working to address a recently discovered issue wherein an incremental backup becomes a full backup when backing up Windows 2008 (or higher) virtual machines with VSS-based application quiesced snapshot.

This recent CBT (Changed Block Tracking) issue does not cause any data loss or data corruption.

This issue is well understood and VMware engineering is actively working on a fix.

For more details on this issue and latest status on resolution, please refer to KB article: After upgrading to ESXi 6.0 Build 3825889, incremental virtual machine backups effectively run as full backups when application consistent quiescing is enabled (2145895)

Subscribe to the rss feed for the KB article to ensure you do not miss any update by using this link.

Top 20 vCenter articles for May 2016

Top 20Here is our Top 20 vCenter articles list for May 2016. This list is ranked by the number of times a VMware Support Request was resolved by following the steps in a published Knowledge Base article.

  1. Purging old data from the database used by VMware vCenter Server
  2. ESXi 5.5 Update 3b and later hosts are no longer manageable after upgrade
  3. Permanent Device Loss (PDL) and All-Paths-Down (APD) in vSphere 5.x and 6.x
  4. Upgrading to vCenter Server 6.0 best practices
  5. ESX/ESXi host keeps disconnecting and reconnecting when heartbeats are not received by vCenter Server
  6. Unlocking and resetting the VMware vCenter Single Sign-On administrator password
  7. Consolidating snapshots in vSphere 5.x/6.0
  8. Powering on a virtual machine fails after a storage outage with the error: could not open/create change tracking file
  9. Diagnosing an ESXi/ESX host that is disconnected or not responding in VMware vCenter Server
  10. VMware vSphere Web Client displays the error: Failed to verify the SSL certificate for one or more vCenter Server Systems
  11. Deprecated VMFS volume warning reported by ESXi hosts
  12. Resetting the VMware vCenter Server and vCenter Server Appliance 6.0 Inventory Service database
  13. Cannot take a quiesced snapshot of Windows 2008 R2 virtual machine
  14. vCenter Server 5.5 fails to start after reboot with the error: Unable to create SSO facade: Invalid response code: 404 Not Found
  15. Update sequence for vSphere 6.0 and its compatible VMware products
  16. Registering or adding a virtual machine to the Inventory in vCenter Server or in an ESX/ESXi host
  17. Back up and restore vCenter Server Appliance/vCenter Server 6.0 vPostgres database
  18. Updating rollup jobs after the error: Performance data is currently not available for this entity
  19. Configuring VMware vCenter Server to send alarms when virtual machines are running from snapshots
  20. Determining where growth is occurring in the VMware vCenter Server database

Top 20 ESXi articles for May 2016

Top 20Here is our Top 20 ESXi articles list for May 2016. This list is ranked by the number of times a VMware Support Request was resolved by following the steps in a published Knowledge Base article.

  1. VMware ESXi 5.x host experiences a purple diagnostic screen mentioning E1000PollRxRing and E1000DevRx
  2. ESXi 5.5 Update 3b and later hosts are no longer manageable after upgrade
  3. Commands to monitor snapshot deletion in VMware ESXi/ESX
  4. Recreating a missing virtual machine disk descriptor file
  5. Determining Network/Storage firmware and driver version in ESXi/ESX 4.x, ESXi 5.x, and ESXi 6.x
  6. Permanent Device Loss (PDL) and All-Paths-Down (APD) in vSphere 5.x and 6.x
  7. Installing patches on an ESXi 5.x/6.x host from the command line
  8. Identifying and addressing Non-Maskable Interrupt events on an ESX/ESXi host
  9. Restarting the Management agents on an ESXi or ESX host
  10. Downloading and installing async drivers in VMware ESXi 5.x and ESXi 6.0.x
  11. Enabling or disabling VAAI ATS heartbeat
  12. ESXi 5.5 or 6.0 host disconnects from vCenter Server with the syslog.log error: Unable to allocate memory
  13. Powering off a virtual machine on an ESXi host
  14. Consolidating snapshots in vSphere 5.x/6.0
  15. Powering on a virtual machine fails after a storage outage with the error: could not open/create change tracking file
  16. Snapshot consolidation in VMware ESXi 5.5.x and ESXi 6.0.x fails with the error: maximum consolidate retries was exceeded for scsix:x
  17. Reverting to a previous version of ESXi
  18. Configuring a diagnostic coredump partition on an ESXi 5.x/6.0 host
  19. Diagnosing an ESXi/ESX host that is disconnected or not responding in VMware vCenter Server
  20. Enabling or disabling simultaneous write protection provided by VMFS using the multi-writer flag

Inventory objects fail to display in the vSphere Web Client

An issue recently causing a few calls into our support lines happens after installing or upgrading to vCenter Server 6.0. The issue also impacts the vCenter appliance and is not specific to Windows. The issue occurs in the vSphere Web Client due to caching within the VMware Inventory Service.

  • When browsing inventory under Host and Clusters, some or no object are displayed
  • When using the Search function in the vSphere Web Client the object is found
  • The Issue does not occur in the vSphere Client
  • When browsing inventory under Related Objects at multiple levels you see this message:
no object found

If you suspect you have run into this head on over to KB: In the vSphere Web Client 6.0 inventory objects fail to display (2144934) for more details on what to check in your logs and what can be done about it.

 

Inventory objects

Top 20 vCenter articles for April 2016

Top 20Here is our Top 20 vCenter articles list for April 2016. This list is ranked by the number of times a VMware Support Request was resolved by following the steps in a published Knowledge Base article.

  1. Purging old data from the database used by VMware vCenter Server
  2. After upgrading an ESXi host to 5.5 Update 3b and later, the host is no longer manageable by vCenter Server
  3. Permanent Device Loss (PDL) and All-Paths-Down (APD) in vSphere 5.x and 6.x
  4. Unlocking and resetting the VMware vCenter Single Sign-On administrator password
  5. Consolidating snapshots in vSphere 5.x/6.0
  6. Deprecated VMFS volume warning reported by ESXi hosts
  7. Upgrading to vCenter Server 6.0 best practices
  8. Resetting the VMware vCenter Server and vCenter Server Appliance 6.0 Inventory Service database
  9. Linked Clone pool creation and recomposition fails with VMware Horizon View 6.1.x and older releases
  10. Replacing default certificates with CA signed SSL certificates in vSphere 6.0
  11. Update sequence for VMware vSphere 5.5 and its compatible VMware products
  12. ESX/ESXi host keeps disconnecting and reconnecting when heartbeats are not received by vCenter Server
  13. VMware vSphere Web Client displays the error: Failed to verify the SSL certificate for one or more vCenter Server Systems
  14. Updating vCenter Server 5.5 to Update 3 fails with the error: Warning 32014. A utility for phone home data collector couldn’t be executed successfully
  15. Reducing the size of the vCenter Server database when the rollup scripts take a long time to run
  16. vCenter Server fails immediately after starting with the error: Fault Module: ntdll.dll
  17. Replacing a vSphere 6.0 Machine SSL certificate with a Custom Certificate Authority Signed Certificate
  18. In the vSphere Web Client 6.0 inventory objects fail to display
  19. Migrating virtual machines with Raw Device Mappings (RDMs)
  20. Cannot take a quiesced snapshot of Windows 2008 R2 virtual machine

Top 20 ESXi articles for April 2016

Top 20Here is our Top 20 ESXi articles list for April 2016. This list is ranked by the number of times a VMware Support Request was resolved by following the steps in a published Knowledge Base article.

  1. VMware ESXi 5.x host experiences a purple diagnostic screen mentioning E1000PollRxRing and E1000DevRx
  2. Commands to monitor snapshot deletion in VMware ESXi/ESX
  3. After upgrading an ESXi host to 5.5 Update 3b and later, the host is no longer manageable by vCenter Server
  4. Determining Network/Storage firmware and driver version in ESXi/ESX 4.x, ESXi 5.x, and ESXi 6.x
  5. Permanent Device Loss (PDL) and All-Paths-Down (APD) in vSphere 5.x and 6.x
  6. Powering off a virtual machine on an ESXi host
  7. Consolidating snapshots in vSphere 5.x/6.0
  8. Recreating a missing virtual machine disk descriptor file
  9. Reverting to a previous version of ESXi
  10. Installing patches on an ESXi 5.x/6.x host from the command line
  11. ESXi 6.0 Update 2 host fails with a purple diagnostic screen containing the error: Vmxnet3VMKDevRxWithLock and Vmxnet3VMKDevRx
  12. Cloning individual virtual machine disks via the ESX/ESXi host terminal
  13. Deprecated VMFS volume warning reported by ESXi hosts
  14. Restarting the Management agents on an ESXi or ESX host
  15. Enabling or disabling VAAI ATS heartbeat
  16. Downloading and installing async drivers in VMware ESXi 5.x and ESXi 6.0.x
  17. Configuring a diagnostic coredump partition on an ESXi 5.x/6.0 host
  18. Issuing a 0x85 SCSI Command from a VMware ESXi 6.0 host results in a PDL error
  19. Information about the error: state in doubt; requested fast path state update
  20. Committing snapshots when there are no snapshot entries in the Snapshot Manager

User account locked in vCenter Server Appliance

vCSAWe’ve recently noticed a number of cases where vSphere administrators become locked out of their accounts or receive reports of incorrect passwords in the vCenter Server Appliance. If you find yourself in this position, here are two articles that address these issues:

KB 2034608
When attempting to log into the VMware vSphere 5.1, 5.5, or 6.0 Web Client you observe the following symptom: “User account is locked. Please contact your administrator.” This often occurs if the wrong password was entered multiple times. Waiting the default 15 minutes lockout period will allow to attempt the login again. If after multiple attempts, you are still not successful, you may need to reset the password.

KB 2069041
When attempting to log into the vCenter Server 5.5 and 6.0 Appliance, you experience symptoms where the root account is locked out. This often occurs because the vCenter Server appliance has a default 90 password expiration policy. Steps on how to modify the password expiration policies and to unlock the password.