Home > Blogs > Support Insider > Category Archives: Knowledge Base

Category Archives: Knowledge Base

Official VMware Knowledge Base

Path failover may not be successful when using Cisco MDS Switches on NX-OS 7.3 and FCoE based HBAs

So I wanted to get this blog post out sooner rather than later as it might effect a significant number of customers. In a nutshell, if you perform array maintenance that requires you to reboot a storage controller, the probability of successful path failover is low. This is effectively due to stale entries in the fiber channel name server on Cisco MDS switches running NX-OS 7.3, which is a rather new code release. As the title suggests, this only affects FCoE HBAs, specifically ones that rely on our libfc/libfcoe stack for FCoE connectivity. Such HBAs would be Cisco fnic HBAs as well as a handful of Emulex FCoE HBAs and a couple others.

Here is an example of a successful path failover after receiving an RSCN (Register State Change Notification) from the array controller after performing a reboot:

2016-07-07T17:36:34.230Z cpu17:33461)<6>host4: disc: Received an RSCN event
 2016-07-07T17:36:34.230Z cpu17:33461)<6>host4: disc: Port address format for port (e50800)
 2016-07-07T17:36:34.230Z cpu17:33461)<6>host4: disc: RSCN received: not rediscovering. redisc 0 state 9 in_prog 0
 2016-07-07T17:36:34.231Z cpu14:33474)<6>host4: disc: GPN_ID rejected reason 9 exp 1
 2016-07-07T17:36:34.231Z cpu14:33474)<6>host4: rport e50800: Remove port
 2016-07-07T17:36:34.231Z cpu14:33474)<6>host4: rport e50800: Port entered LOGO state from Ready state
 2016-07-07T17:36:34.231Z cpu14:33474)<6>host4: rport e50800: Delete port
 2016-07-07T17:36:34.231Z cpu54:33448)<6>host4: rport e50800: work event 3
 2016-07-07T17:36:34.231Z cpu54:33448)<7>fnic : 4 :: fnic_rport_exch_reset called portid 0xe50800
 2016-07-07T17:36:34.231Z cpu54:33448)<7>fnic : 4 :: fnic_rport_reset_exch: Issuing abts
 2016-07-07T17:36:34.231Z cpu54:33448)<6>host4: rport e50800: Received a LOGO response closed
 2016-07-07T17:36:34.231Z cpu54:33448)<6>host4: rport e50800: Received a LOGO response, but in state Delete
 2016-07-07T17:36:34.231Z cpu54:33448)<6>host4: rport e50800: work delete

Here is a breakdown of what you just read:

  1. RSCN is received from the array controller
  2. Operation is now is state = 9
  3. GPN_ID (Get Port Name ID) is issued to the switches but is rejected because the state is 9 (See http://lists.open-fcoe.org/pipermail/fcoe-devel/2009-June/002828.html)
  4. LibFC begins to remove the port information on the host
  5. Port enters LOGO (Logout) state from previous state, which was Ready
  6. LibFC Deletes the port information

After this the ESX host will failover to other available ports, which would be on the peer SP:

2016-07-07T17:36:44.233Z cpu33:33459)<3> rport-4:0-1: blocked FC remote port time out: saving binding
 2016-07-07T17:36:44.233Z cpu55:33473)<7>fnic : 4 :: fnic_terminate_rport_io called wwpn 0x524a937aeb740513, wwnn0xffffffffffffffff, rport 0x0x4309b72f3c50, portid 0xffffffff
 2016-07-07T17:36:44.257Z cpu52:33320)NMP: nmp_ThrottleLogForDevice:3298: Cmd 0x2a (0x43a659d15bc0, 36277) to dev "naa.624a93704d1296f5972642ea0001101c" on path "vmhba3:C0:T0:L1" Failed: H:0x1 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Act:FAILOVER

A Host status of H:0x1 means NO_CONNECT, hence the failover.

Now here is an example of the same operation on a Cisco MDS switch running NX-OS 7.3 when a storage controller on the array is rebooted:

2016-07-14T19:02:03.551Z cpu47:33448)<6>host2: disc: Received an RSCN event
 2016-07-14T19:02:03.551Z cpu47:33448)<6>host2: disc: Port address format for port (e50900)
 2016-07-14T19:02:03.551Z cpu47:33448)<6>host2: disc: RSCN received: not rediscovering. redisc 0 state 9 in_prog 0
 2016-07-14T19:02:03.557Z cpu47:33444)<6>host2: rport e50900: ADISC port
 2016-07-14T19:02:03.557Z cpu47:33444)<6>host2: rport e50900: sending ADISC from Ready state
 2016-07-14T19:02:23.558Z cpu47:33448)<6>host2: rport e50900: Received a ADISC response
 2016-07-14T19:02:23.558Z cpu47:33448)<6>host2: rport e50900: Error 1 in state ADISC, retries 0
 2016-07-14T19:02:23.558Z cpu47:33448)<6>host2: rport e50900: Port entered LOGO state from ADISC state
 2016-07-14T19:02:43.560Z cpu2:33442)<6>host2: rport e50900: Received a LOGO response timeout
 2016-07-14T19:02:43.560Z cpu2:33442)<6>host2: rport e50900: Error -1 in state LOGO, retrying
 2016-07-14T19:02:43.560Z cpu58:33446)<6>host2: rport e50900: Port entered LOGO state from LOGO state
 2016-07-14T19:03:03.563Z cpu54:33449)<6>host2: rport e50900: Received a LOGO response timeout
 2016-07-14T19:03:03.563Z cpu54:33449)<6>host2: rport e50900: Error -1 in state LOGO, retrying
 2016-07-14T19:03:03.563Z cpu2:33442)<6>host2: rport e50900: Port entered LOGO state from LOGO state
 2016-07-14T19:03:23.565Z cpu32:33447)<6>host2: rport e50900: Received a LOGO response timeout
 2016-07-14T19:03:23.565Z cpu32:33447)<6>host2: rport e50900: Error -1 in state LOGO, retrying
 2016-07-14T19:03:23.565Z cpu54:33449)<6>host2: rport e50900: Port entered LOGO state from LOGO state
 2016-07-14T19:03:43.567Z cpu50:33445)<6>host2: rport e50900: Received a LOGO response timeout
 2016-07-14T19:03:43.567Z cpu50:33445)<6>host2: rport e50900: Error -1 in state LOGO, retrying
 2016-07-14T19:03:43.567Z cpu32:33447)<6>host2: rport e50900: Port entered LOGO state from LOGO state
 2016-07-14T19:04:03.568Z cpu54:33443)<6>host2: rport e50900: Received a LOGO response timeout
 2016-07-14T19:04:03.568Z cpu54:33443)<6>host2: rport e50900: Error -1 in state LOGO, retrying
 2016-07-14T19:04:03.569Z cpu32:33472)<6>host2: rport e50900: Port entered LOGO state from LOGO state
 2016-07-14T19:04:43.573Z cpu20:33473)<6>host2: rport e50900: Received a LOGO response timeout
 2016-07-14T19:04:43.573Z cpu20:33473)<6>host2: rport e50900: Error -1 in state LOGO, retrying
 2016-07-14T19:04:43.573Z cpu54:33443)<6>host2: rport e50900: Port entered LOGO state from LOGO state

Notice the difference? Here is a breakdown of what happened this time:

  1. RSCN is received from the array controller
  2. Operation is now is state = 9
  3. GPN_ID (Get Port Name ID) is issued to the switches but is NOT rejected
  4. Since GPN_ID is valid, LibFC issues an Address Discovery (ADISC)
  5. 20 seconds later the ADISC sent times out and this continues to occur every 20 seconds

The problem is that the ADISC will continue this behavior until the array controller completes the reboot and is back online:

2016-07-14T19:04:47.276Z cpu56:33451)<6>host2: disc: Received an RSCN event
 2016-07-14T19:04:47.276Z cpu56:33451)<6>host2: disc: Port address format for port (e50900)
 2016-07-14T19:04:47.276Z cpu56:33451)<6>host2: disc: RSCN received: not rediscovering. redisc 0 state 9 in_prog 0
 2016-07-14T19:04:47.277Z cpu20:33454)<6>host2: rport e50900: Login to port
 2016-07-14T19:04:47.277Z cpu20:33454)<6>host2: rport e50900: Port entered PLOGI state from LOGO state
 2016-07-14T19:04:47.278Z cpu57:33456)<6>host2: rport e50900: Received a PLOGI accept
 2016-07-14T19:04:47.278Z cpu57:33456)<6>host2: rport e50900: Port entered PRLI state from PLOGI state
 2016-07-14T19:04:47.278Z cpu52:33458)<6>host2: rport e50900: Received a PRLI accept
 2016-07-14T19:04:47.278Z cpu52:33458)<6>host2: rport e50900: PRLI spp_flags = 0x21
 2016-07-14T19:04:47.278Z cpu52:33458)<6>host2: rport e50900: Port entered RTV state from PRLI state
 2016-07-14T19:04:47.278Z cpu57:33452)<6>host2: rport e50900: Received a RTV reject
 2016-07-14T19:04:47.278Z cpu57:33452)<6>host2: rport e50900: Port is Ready

What is actually happening here is that the Cisco MDS switches are quick to receive the RSCN from the array controller and pass it along to the host HBAs however due to a timing issue the entries for that array controller in the FCNS (Fiber Channel Name Server) database are still present when the host HBAs issue the GPN_ID so the switches respond to that request instead of rejecting it. If you review the entry in http://lists.open-fcoe.org/pipermail/fcoe-devel/2009-June/002828.html you see that code was added to validate that the target is actually off the fabric instead of assuming it would be by the RSCN alone. There are various reasons to do this but suffice it to say that it is better to be safe than sorry in this instance.

Unfortunately there is no fix for this at this time, which is why this is potentially so impactful to our customers since it means they effectively are unable to perform array maintenance without the risk of VMs crashing or even corruption. Cisco is fixing this in 7.3(1), which due out in a few weeks.

Here are a couple of references regarding this issue:

 

Cheers,
Nathan Small
Technical Director
Global Support Services
VMware

Top 20 vSAN articles for July 2016

Top 20Here is our Top 20 vSAN articles list for July 2016. This list is ranked by the number of times a VMware Support Request was resolved by following the steps in a published Knowledge Base article.

  1. Performance Degradation of Hybrid disk groups on VSAN 6.2 Deployments
  2. vSphere 5.5 Virtual SAN requirements
  3. Requirements and considerations for the deployment of VMware Virtual SAN (VSAN)
  4. VMware Virtual SAN 6.1 fulfillment
  5. Considerations when using both Virtual SAN and non-Virtual SAN disks with the same storage controller
  6. “Host cannot communicate with all other nodes in virtual SAN enabled cluster” error
  7. Virtual SAN 6.2 on disk upgrade fails at 10%
  8. Enabling or disabling a Virtual SAN cluster
  9. Network interfaces used for Virtual SAN are not ready
  10. Adding a host back to a Virtual SAN cluster after an ESXi host rebuild
  11. Changing the multicast address used for a VMware Virtual SAN Cluster
  12. Creating or editing a virtual machine Storage Policy to correct a missing Virtual SAN (VSAN) VASA provider fails
  13. VSAN disk components are marked ABSENT after enabling CBT
  14. Cannot see or manually add VMware Virtual SAN (VSAN) Storage Providers in the VMware vSphere Web Client
  15. Creating new objects on a VMware Virtual SAN Datastore fails and reports the error: Failed to create directory VCENTER (Cannot Create File)
  16. Powering on virtual machines in VMware Virtual SAN 5.5 fails with error: Failed to create swap file
  17. Virtual SAN Health Service – Limits Health – After one additional host failure
  18. VMware recommended settings for RAID0 logical volumes on certain 6G LSI based RAID VSAN
  19. Upgrading the VMware Virtual SAN (VSAN) on-disk format version from 1 to 2
  20. Understanding Virtual SAN on-disk format versions

NSX for vSphere Field Advisory – June 2016 Edition

End of General Support for VMware NSX for vSphere 6.1.x has been extended by 3 months to January 15th, 2017. This is to allow customers to have time to upgrade from NSX for vSphere 6.1.7, which contains an important security patch improving input validation of the system, to the latest 6.2.x release. For recommended upgrade paths, refer to the latest NSX for vSphere 6.2 Release Notes and the VMware Interoperability Matrix.
Migration of Service VM (SVM) may cause ESXi host issues in VMware NSX for vSphere 6.x (2141410). See also the CAUTION statement in the 6.2.3 Administration Guide.

Do not migrate the Service VM (SVM) manually (vMotion/SvMotion) to another ESXi host in the cluster.
The latest versions of vSphere 5.5 and 6.0 inhibit vMotion migration. However, storage vMotion is not blocked, and such movement may lead to unpredictable results on the destination host.

vCenter Server 6.0 restart/reboot results in duplicate VTEPs on VXLAN prepared ESXi hosts (2144605). The NSX-side update to protect against this issue is available in 6.2.3. This issue will be resolved fully in a future version of vCenter.

Top Issues:

Important new and changed KBs with NSX for vSphere 6.2.3. For more information, see Troubleshooting VMware NSX for vSphere 6.x (2122691).

Important new and changed documentation with NSX for vSphere 6.2.3 – see the NSX Documentation Center

How to track the top field issues

Windows 2008+ incremental backups become full backups in ESXi 6.0 b3825889

vmware_tools_iconVMware is actively working to address a recently discovered issue wherein an incremental backup becomes a full backup when backing up Windows 2008 (or higher) virtual machines with VSS-based application quiesced snapshot.

This recent CBT (Changed Block Tracking) issue does not cause any data loss or data corruption.

This issue is well understood and VMware engineering is actively working on a fix.

For more details on this issue and latest status on resolution, please refer to KB article: After upgrading to ESXi 6.0 Build 3825889, incremental virtual machine backups effectively run as full backups when application consistent quiescing is enabled (2145895)

Subscribe to the rss feed for the KB article to ensure you do not miss any update by using this link.

New and updated KB articles for NSX for vSphere 6.2.3

NSXWe’ve just released the bits for NSX for vSphere 6.2.3 and thought all of your making the upgrade would want to be on top of all the ins and outs of this release.

Here is a list of new and/or updated articles in our Knowledgebase:
NSX for vSphere 6.2.3   |   Released 09 June 2016   |   Build 3979471

Of course, do not miss the release notes, which can be found here: NSX for vSphere 6.2.3 Release Notes

NSX for vSphere Field Advisory – May 2016 Edition

NSXOur NSX support team would like all of our customers to know about important KB updates for NSX for vSphere issues.

Here’s what’s new and trending-

Important: Upgrades from NSX for vSphere 6.1.6 to 6.2.2 are not supported.
See KB 2145543 NSX Controller upgrade fails with the error: 409 (Conflict); invoking error handler.

vCloud Networking and Security will reach End of Availability and End of General Support on September 19, 2016.

NEW! See our first NSX KBTV YouTube video: https://youtu.be/5pSNfnk1_MA

vShield Endpoint and vCNS End Of Availability (EOA) – See:
KB 2105558 Support for partner integrations with VMware vShield Endpoint and VMware vCloud Networking and Security.
KB 2110078 Implementation of VMware vShield Endpoint beyond vCloud Networking and Security End of Availability (EOA).
Future releases of NSX for vSphere 6.2.x will enable customers to manage vShield Endpoint from NSX Manager. Customers who purchased vSphere with vShield Endpoint will be able to download NSX.

NSX for vSphere 6.1.x will reach End of General Support on October 15, 2016.

​NEW! VMware has extended the End of General Support date to three years after GA for NSX for vSphere 6.2.x only.  The VMware Lifecycle Product Matrix now reflects this change.

New and Important issues:

KB 2144551 Configuring a default gateway on the DLR in NSX fails
KB 2144605 vCenter 6.0 restart/reboot may result in duplicate VTEPs on VXLAN prepared ESX hosts.
KB 2143998 NSX Edge virtual machines do not failover during a vSphere HA event
KB 2145571 NSX Edge fails to power on when logging all ACCEPT firewall rules
KB 2145468 NSX Edge uplink interface does not process any traffic after it is disabled and re-enabled in ECMP environment
KB 2139067 Shutdown/Startup order of the NSX for vSphere 6.x environment after a maintenance window or a power outage.  Please refer to the updated sequence for a cross VC environment.
KB 2145447 NetX/Service Instance filter created in vCNS disappears after upgrading to NSX
KB 2145322 NSX Edge logs show Memory Overloaded warnings
KB 2144901 Unexpected TCP interruption on TCP sessions during Edge High Availability (HA) failover in VMware NSX for vSphere 6.2.x
KB 2145273 Troubleshooting DLR using NSX Central CLI
KB 2145359 Pings fail between two VMs on different hosts across a logical switch

How to track the top field issues:

Important NSX for vSphere KB Updates – March 2016

vCloud Networking and Security will reach end of availability and end of support on September 19, 2016.

  • KB 2144733 – End of Availability and End of Support Life for VMware vCloud Networking and Security 5.5.x
  • See the fully updated vCNS to NSX Upgrade Guide
  • See also KB 2144620 – VMware vCloud Networking and Security 5.5.x upgrade to NSX for vSphere 6.2.x Best Practices
  • ​Upgrade path from vCNS 5.x: Using the NSX upgrade bundle posted on or after 31 March, 2016, you may upgrade directly from vCNS 5.1.x or vCNS 5.5.x to NSX 6.2.2 Please see the NSX 6.2.2 release notes
  • Upgrades from NSX 6.1.5 to NSX 6.2.0 are not supported. Instead, you must upgrade from NSX 6.1.5 to NSX 6.2.1 or later to avoid a regression in functionality. Refer to KB 2129200

NSX for vSphere 6.1.x will reach end of availability and end of support on October 15, 2016

  • KB 2144769End of Availability and End of Support Life for VMware NSX for vSphere 6.1.x
  • The recommended release for NSX-V is 6.2.2. Refer to KB 2144295Recommended minimum version for VMware NSX for vSphere with Guest Introspection Driver, ESXi and vCenter server.

New issues:

  • KB 2144726Service Composer fails to translate virtual machines into security-groups in VMware NSX for vSphere 6.x
  • KB 2140891Storage vMotion of Edge appliance disrupts VIX communication in VMware vCloud Networking and Security 5.5.x and NSX for vSphere 6.x
  • KB 2144476After reinstalling vCenter Server 6.0 EAM fails to push VIB’s to ESXi host with the error: Host not covered by scope anymore
  • KB 2144456Importing draft firewall rules fails after existing firewall configuration is removed by a REST API request
  • KB 2144387After upgrading to VMware NSX for vSphere 6.2.2 there is no upgrade option available for NSX Guest Introspection and NSX Data Security and the services remain at version 6.2.1
  • KB 2144420Any changes to the Primary UDLR result in the vNic_0 being shutdown on the Seconday UDLR in VMware Cross-vCenter NSX for vSphere 6.2.x
  • KB 2144236VMtools issue – Virtual machine performance issues after upgrading VMware tools version to 10.0.x in NSX/ VMware vCloud Networking and Security 5.5.x
  • KB 2144649 – IPv4 IP address do not get auto approved when SpoofGuard policy is set to Trust On First Use (TOFU) in VMware NSX for vSphere 6.2.x
  • KB 2144732 – In VMware NSX for vSphere 6.x, unpreparing Stateless ESXi host fails with the error: Agent VIB module is not installed. Cause : 15 The installation transaction failed. The transaction is not supported
  • KB 2135956 – VMware ESXi 6.0 Update 1 host fails with a purple diagnostic screen and reports the error: PANIC bora/vmkernel/main/dlmalloc.c:4923 – Usage error in dlmallocnow resolved in vSphere 6.0U2. See also the vSphere 6.0U2 Release Notes
  • KB 2126275Publishing Distributed Firewall (DFW) rules fails after referenced object is deleted in VMware NSX for vSphere 6.1.x and 6.2.x

Tracking the top issues:

How to Hyperlink KB articles

There are many, many, blogs, forums, videos channels out there that cover various VMware topics, and we commonly see our KB articles being linked from those sites.

We wanted to help those doing this with a couple of quick tips that will make your links easier to read and more resilient to potential changes in our CMS in the future.

When you open one of our KB articles and look at your URL bar, you’ll see a long, and parameterized URL. For example:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1006427

As you can see, this URL includes unnecessary parameters for sharing and just looks long and ugly. Here’s a much cleaner format to use:

https://kb.vmware.com/kb/1006427

The beginning part is always the same, just substitute the actual KB number you wish to reference at the end.  Also note that we are now using https rather than the older and less secure http protocol.

Hope this helps those of you sharing our content; you have our thanks!

Knowledgebase improvements

Over the weekend, the VMware Knowledgebase website was upgraded to improve the customer experience in several ways.

Main objectives of the upgrade were to:

  • Upgrade the Knowledgebase application software/hardware
  • Improve stability and security of the website
  • Implement Secure Sockets Layer (SSL)
  • Implement a Captcha system into the comments area

We will continue our efforts to make the Knowledgebase system as secure and robust as our customers expect.

New NSX KBs you need to know

Here is a list of currently trending KBs for NSX for vSphere

Netcpa issues in VMware NSX for vSphere 6.x (2137005)
Symptoms:

  • Virtual machines running on the same ESXi host fails to communicate with each other
  • Virtual machine fails to communicate with the NSX Edge Gateway (ESG)
  • Routing does not appear to be functioning despite having a defined route for the NSX Edge Gateway
  • Rebooting the NSX Edge does not resolve the issue
  • Running the esxcli network vswitch dvs vmware vxlan network list –vds-name=Name_VDS command on the ESXi host displays the VNIs as down

Network connectivity issues after upgrading or repreparing ESXi hosts in a VMware NSX/VCNS environment (2107951)
Symptoms:

  • Connecting virtual machines to a logical switch fails
  • Re-connecting the virtual machine to a logical switch fails

Oracle connections times out when forwarded through the VMware NSX for vSphere 6.1.x Edge (2126674)
Symptoms:

  • Oracle connections times out when forwarded through the VMware NSX for vSphere 6.1.x Edge
  • The NSX Edge is dropping packets and client/server keeps re-transmitting the same particular packets

MSRPC connections time out when forwarded through the VMware NSX for vSphere 6.1.x Edge (2137751)
Symptoms:

  • MSRPC connections time out when forwarded through the VMware NSX for vSphere 6.1.x Edge

Linux virtual machines with NFSv3 mounts experience an operating system hang after more than 15 minutes outage on the upstream datapath (2133815)
Symptoms:

  • Applications interacting with an NFSv3 mount experience a hang on hard mounted NFSv3 mounts or an error on soft mounted NFSv3 mounts after more than 15 minute upstream data path failure
  • The virtual machine performance increasingly degrades with the time it resides in an NFS hung state

VMware ESXi 6.0 Update 1 host fails with a purple diagnostic screen and reports the error: PANIC bora/vmkernel/main/dlmalloc.c:4923 – Usage error in dlmalloc (2135956)
Symptoms:

  • VMware ESXi 6.x host fails with a purple diagnostic screen
  • Netflow function is enabled on the ESXi host

Recommended minimum VMware Tools versions for Guest Instrospection in VMware NSX for vSphere 6.x (2139740)
Known issues:

  • Windows virtual machines using the vShield Endpoint TDI Manager or NSX Network Introspection Driver (vnetflt.sys) driver fails with a blue diagnostic screen

Understanding and troubleshooting Message Bus in VMware NSX for vSphere 6.x (2133897)
Symptoms:

  • This article can be referred to when it is suspected that a communication issue between the NSX Manager and the ESXi hosts is responsible for symptoms observed such as failure to publish configuration. Examples of such symptoms are:
    • Publishing firewall rules fails
    • Some of the ESXi hosts does not have the VDR/LIF information configured through the NSX Manager