Home > Blogs > Support Insider > Tag Archives: Troubleshooting

Tag Archives: Troubleshooting

VSAN Troubleshooting Video Series

vmware-vsanVMware Technical Support University is very proud to present our Virtual SAN Troubleshooting Video Series, comprising of 26 bite sized videos- not only covering troubleshooting but also design, requirements, compatibility, and upgrades. These were first presented in the  VSAN trouble shooting webinar that was conducted 27 October, 2016. This video collection, presented by Francis Daly is not to be missed!

  1. Introduction to vSAN Troubleshooting – [1:27]
  2. vSAN Compatibility: Introduction – [2:51]
  3. vSAN Compatibility: vSAN Hardware & Software Requirements – [2:16]
  4. vSAN Compatibility: vSAN Architectural Best Practices – [7:57]
  5. vSAN Compatibility: Is My SSD Supported in vSAN? – [9:20]
  6. vSAN Compatibility: Is My RAID Controller Supported? – [8:00]
  7. Storage Policies: Introduction – [1:28]
  8. Storage Policies: vSAN Storage Policies In-Depth – [2:23]
  9. Storage Policies: What Are Witnesses? – [3:29]
  10. Storage Policies: Component States – [4:59]
  11. Storage Policies: Policy & Object States – [3:22]
  12. Storage Policies: Component Layout FTT1 SW1 – [3:25]
  13. Storage Policies: Component Layout FTT1 SW2 – [3:52]
  14. Storage Policies: Component Layout FTT1 SW1 VMDK 400 – [3:12]
  15. Storage Policies: FTT2 SW2 VMDK100 – [1:58]
  16. Storage Policies: Summary – [3:31]
  17. Common Issues: The Upgrade Process – [2:15]
  18. Common Issues: Upgrade Best Practices – [7:25]
  19. Common Issues: Inaccessible Objects – [3:25]
  20. Common Issues: Creation, Modification of Disk Groups – [3:43]
  21. Common Issues: Capacity – [1:34]
  22. Common Issues: Summary – [1:39]
  23. vSAN Software Components – [7:31]
  24. vSAN Troubleshooting Tools (Part 1) – [7:46]
  25. vSAN Troubleshooting Tools (Part 2) – [9:28]
  26. Summary – vSAN Troubleshooting – [1:23]

Troubleshooting Virtual SAN Providers status- Disconnected

This video demonstrates how to troubleshoot Virtual SAN Providers which display the status as disconnected in the vSphere Web Client. This issue occurs if the SMS certificate for vCenter server is expired.

To resolve this issue, expired certificate will be removed and a new certificate will be generated.

Troubleshooting Virtual SAN on-disk format upgrade to 3.0 failures

This video demonstrates how to troubleshoot Virtual SAN on-disk format upgrade to 3.0, which may fail in small Virtual SAN clusters or ROBO/stretched clusters.

Attempting an on-disk upgrade in certain VSAN configurations may result in failure. Configurations that can cause these errors include:

  • The stretched VSAN Cluster consists of two ESXi Hosts and the Witness Node (ROBO configuration)
  • Each Host in the Stretched Cluster contains a single VSAN Disk Group
  • A Virtual SAN cluster consists of three normal nodes, with one disk group per node
  • A Virtual SAN cluster is very full, preventing the “full data migration” disk-group decommission mode

To allow an upgrade to proceed in these configurations, a compromise as to availability must be made. Data accessibility will be maintained, but the redundant copy of the data will be lost and rebuilt during the upgrade process. As a result, data will be exposed to faults and failures such as the loss of a disk on another node may result in data loss. This exposure to additional failure risk is referred to as “reduced redundancy,” and must be manually specified in the Ruby vSphere Console (RVC) to allow the upgrade to proceed. It is not possible to specify reduced redundancy when using the vSphere Web Client to start the upgrade.

Caution: During upgrade, a single point of failure is exposed. Follow all VMware best practices, and your business practices, regarding the backup of important data and virtual machines.

Important KB updates for current NSX for vSphere users -May 2016 Edition

NSXOur NSX support team would like all of our customers to know about important KB updates for current NSX for vSphere issues. Here’s what’s new and trending-

Please take note of key updates to the following important End of General Support and End of Availability events:

New and important issues:

NSX for Multi-Hypervisor:

New master playbook KBs:

How to track the top field issues:

 

Host disconnected from vCenter and VMs showing as inaccessible

Another deep-dive troubleshooting blog today from Nathan Small (twitter account: @vSphereStorage)
 
Description from customer:
 
Host is getting disconnected from vCenter and VMs are showing as inaccessible. Only one host is affected.
 
 
Analysis:
 
A quick review of the vmkernel log shows a log spew of H:0x7 errors to numerous LUNs. Here is a short snippet where you can see how frequently they are occurring (multiple times per second):
 
# cat /var/log/vmkernel.log
 
2016-01-13T18:54:42.994Z cpu68:8260)ScsiDeviceIO: 2326: Cmd(0x412540b96e80) 0x28, CmdSN 0x8000006b from world 11725 to dev “naa.600601601b703400a4f90c3d0668e311” failed H:0x7 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2016-01-13T18:54:43.027Z cpu68:8260)ScsiDeviceIO: 2326: Cmd(0x4125401b2580) 0x28, CmdSN 0x8000002e from world 11725 to dev “naa.600601601b70340064a24ada10fae211” failed H:0x7 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2016-01-13T18:54:43.030Z cpu68:8260)ScsiDeviceIO: 2326: Cmd(0x4125406d5380) 0x28, CmdSN 0x80000016 from world 11725 to dev “naa.600601601b7034000c70e4e610fae211” failed H:0x7 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2016-01-13T18:54:43.542Z cpu67:8259)ScsiDeviceIO: 2326: Cmd(0x412540748800) 0x28, CmdSN 0x80000045 from world 11725 to dev “naa.600601601b70340064a24ada10fae211” failed H:0x7 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2016-01-13T18:54:43.808Z cpu74:8266)ScsiDeviceIO: 2326: Cmd(0x412541229040) 0x28, CmdSN 0x8000003c from world 11725 to dev “naa.600601601b7034008e56670a11fae211” failed H:0x7 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2016-01-13T18:54:44.088Z cpu38:8230)ScsiDeviceIO: 2326: Cmd(0x4124c0ff4f80) 0x28, CmdSN 0x80000030 from world 11701 to dev “naa.600601601b703400220f77ab15fae211” failed H:0x7 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2016-01-13T18:54:44.180Z cpu74:8266)ScsiDeviceIO: 2326: Cmd(0x412540ccda80) 0x28, CmdSN 0x80000047 from world 11725 to dev “naa.600601601b70340042b582440668e311” failed H:0x7 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2016-01-13T18:54:44.741Z cpu61:8253)ScsiDeviceIO: 2326: Cmd(0x412540b94480) 0x28, CmdSN 0x80000051 from world 11725 to dev “naa.600601601b70340060918f5b0668e311” failed H:0x7 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2016-01-13T18:54:44.897Z cpu63:8255)ScsiDeviceIO: 2326: Cmd(0x412540ff3180) 0x28, CmdSN 0x8000007a from world 11725 to dev “naa.600601601b7034005c918f5b0668e311” failed H:0x7 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2016-01-13T18:54:45.355Z cpu78:8270)ScsiDeviceIO: 2326: Cmd(0x412540f3b2c0) 0x28, CmdSN 0x80000039 from world 11725 to dev “naa.600601601b70340060918f5b0668e311” failed H:0x7 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2016-01-13T18:54:45.522Z cpu70:8262)ScsiDeviceIO: 2326: Cmd(0x41254073d0c0) 0x28, CmdSN 0x8000002c from world 11725 to dev “naa.600601601b7034000e3e97350668e311” failed H:0x7 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2016-01-13T18:54:45.584Z cpu71:8263)ScsiDeviceIO: 2326: Cmd(0x412541021780) 0x28, CmdSN 0x80000067 from world 11725 to dev “naa.600601601b7034000e3e97350668e311” failed H:0x7 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2016-01-13T18:54:45.803Z cpu63:8255)ScsiDeviceIO: 2326: Cmd(0x412540d20480) 0x28, CmdSN 0x80000019 from world 11725 to dev “naa.600601601b703400d24fc7620668e311” failed H:0x7 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2016-01-13T18:54:46.253Z cpu74:8266)ScsiDeviceIO: 2326: Cmd(0x412540b96380) 0x28, CmdSN 0x8000006f from world 11725 to dev “naa.600601601b7034005e918f5b0668e311” failed H:0x7 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
 
The Host side error (H:0x7) literally translates to Storage Initiator Error, which makes it sounds like there is something physical wrong with the card. One needs to understand that this status is sent up the stack from the HBA driver so really it is up to the those that write the driver to use this status for certain conditions. As there are no accompanying errors from the HBA driver, which in this case is a Brocade HBA, this is all we have to work with without enabling verbose logging in the driver. Verbose logging requires a reboot so this is not always an option when investigating root cause. The exception would be that the issue in ongoing so rebooting a host to capture this data is a viable option.
 
Taking a LUN as an example from ‘esxcfg-mpath -b’ output to get a view of the paths and targets:
 
# esxcfg-mpath -b
 
naa.600601601b703400b6aa124c0668e311 : DGC Fibre Channel Disk (naa.600601601b703400b6aa124c0668e311)
   vmhba0:C0:T3:L20 LUN:20 state:active fc Adapter: WWNN: 20:00:74:86:7a:ae:1c:9a WWPN: 20:01:74:86:7a:ae:1c:9a  Target: WWNN: 50:06:01:60:c7:20:7a:a8 WWPN: 50:06:01:63:47:20:7a:a8
   vmhba1:C0:T3:L20 LUN:20 state:active fc Adapter: WWNN: 20:00:74:86:7a:ae:1c:9c WWPN: 20:01:74:86:7a:ae:1c:9c  Target: WWNN: 50:06:01:60:c7:20:7a:a8 WWPN: 50:06:01:60:47:24:7a:a8
   vmhba0:C0:T1:L20 LUN:20 state:active fc Adapter: WWNN: 20:00:74:86:7a:ae:1c:9a WWPN: 20:01:74:86:7a:ae:1c:9a  Target: WWNN: 50:06:01:60:c7:20:7a:a8 WWPN: 50:06:01:6b:47:20:7a:a8
   vmhba1:C0:T2:L20 LUN:20 state:active fc Adapter: WWNN: 20:00:74:86:7a:ae:1c:9c WWPN: 20:01:74:86:7a:ae:1c:9c  Target: WWNN: 50:06:01:60:c7:20:7a:a8 WWPN: 50:06:01:68:47:24:7a:a8
   vmhba2:C0:T3:L20 LUN:20 state:active fc Adapter: WWNN: 20:00:74:86:7a:ae:1c:32 WWPN: 20:01:74:86:7a:ae:1c:32  Target: WWNN: 50:06:01:60:c7:20:7a:a8 WWPN: 50:06:01:63:47:20:7a:a8
   vmhba3:C0:T3:L20 LUN:20 state:active fc Adapter: WWNN: 20:00:74:86:7a:ae:1c:34 WWPN: 20:01:74:86:7a:ae:1c:34  Target: WWNN: 50:06:01:60:c7:20:7a:a8 WWPN: 50:06:01:60:47:24:7a:a8
   vmhba2:C0:T1:L20 LUN:20 state:active fc Adapter: WWNN: 20:00:74:86:7a:ae:1c:32 WWPN: 20:01:74:86:7a:ae:1c:32  Target: WWNN: 50:06:01:60:c7:20:7a:a8 WWPN: 50:06:01:6b:47:20:7a:a8
   vmhba3:C0:T2:L20 LUN:20 state:active fc Adapter: WWNN: 20:00:74:86:7a:ae:1c:34 WWPN: 20:01:74:86:7a:ae:1c:34  Target: WWNN: 50:06:01:60:c7:20:7a:a8 WWPN: 50:06:01:68:47:24:7a:a8
 
Let’s look at the adapter statistics for all HBAs. I would recommend always using localcli over esxcli when troubleshoot as esxcli requires hostd to be functioning properly:
 
# localcli storage core adapter stats get
 
vmhba0:
   Successful Commands: 844542177
   Blocks Read: 243114868277
   Blocks Written: 25821448417
  Read Operations: 395494703
   Write Operations: 405753901
   Reserve Operations: 0
   Reservation Conflicts: 0
   Failed Commands: 35403
   Failed Blocks Read: 57744
   Failed Blocks Written: 16843
   Failed Read Operations: 8224
   Failed Write Operations: 16450
   Failed Reserve Operations: 0
   Total Splits: 0
   PAE Commands: 0
 
vmhba1:
   Successful Commands: 502595840 <– Far less successful commands than the other adapters
   Blocks Read: 116436597821
   Blocks Written: 16509939615
   Read Operations: 216572537
   Write Operations: 245276523
   Reserve Operations: 0
   Reservation Conflicts: 0
   Failed Commands: 10942696
   Failed Blocks Read: 12055379188 <– 12 billion failed blocks read! Other adapters are all less than 60,000
   Failed Blocks Written: 933809
   Failed Read Operations: 10895926
   Failed Write Operations: 25645
   Failed Reserve Operations: 0
   Total Splits: 0
   PAE Commands: 0
 
vmhba2:
   Successful Commands: 845976973
   Blocks Read: 244034940187
   Blocks Written: 26063852941
   Read Operations: 397564994
   Write Operations: 407538414
   Reserve Operations: 0
   Reservation Conflicts: 0
   Failed Commands: 40468
   Failed Blocks Read: 44157
   Failed Blocks Written: 18676
   Failed Read Operations: 5506
   Failed Write Operations: 12152
   Failed Reserve Operations: 0
   Total Splits: 0
   PAE Commands: 0
 
vmhba3:
   Successful Commands: 866718515
   Blocks Read: 249837164491
   Blocks Written: 26492209531
   Read Operations: 406367844
   Write Operations: 416901703
   Reserve Operations: 0
   Reservation Conflicts: 0
   Failed Commands: 37723
   Failed Blocks Read: 23191
   Failed Blocks Written: 139380
   Failed Read Operations: 7372
   Failed Write Operations: 14878
   Failed Reserve Operations: 0
   Total Splits: 0
   PAE Commands: 0
 
 
Let’s see how often the vmkernel.log reports messages for that HBA:
 
# cat vmkernel.log |grep vmhba0|wc -l
112
 
# cat vmkernel.log |grep vmhba1|wc -l
8474 <– over 8000 times this HBA is mentioned! This doesn’t mean they are all errors, of course, but based on the log spew we know is already occurring it means it likely is
 
# cat vmkernel.log |grep vmhba2|wc -l
222
 
# cat vmkernel.log |grep vmhba3|wc -l
335
 
Now let’s take a look at the zoning to see if multiple adapters are zoned to the exact same array targets (WWPN) in attempt to determine if the issue is possibly array side or HBA side:
 
# esxcfg-mpath -b
 
naa.600601601b703400b6aa124c0668e311 : DGC Fibre Channel Disk (naa.600601601b703400b6aa124c0668e311)
   vmhba0:C0:T3:L20 LUN:20 state:active fc Adapter: WWNN: 20:00:74:86:7a:ae:1c:9a WWPN: 20:01:74:86:7a:ae:1c:9a  Target: WWNN: 50:06:01:60:c7:20:7a:a8 WWPN: 50:06:01:63:47:20:7a:a8
   vmhba1:C0:T3:L20 LUN:20 state:active fc Adapter: WWNN: 20:00:74:86:7a:ae:1c:9c WWPN: 20:01:74:86:7a:ae:1c:9c  Target: WWNN: 50:06:01:60:c7:20:7a:a8 WWPN: 50:06:01:60:47:24:7a:a8
   vmhba0:C0:T1:L20 LUN:20 state:active fc Adapter: WWNN: 20:00:74:86:7a:ae:1c:9a WWPN: 20:01:74:86:7a:ae:1c:9a  Target: WWNN: 50:06:01:60:c7:20:7a:a8 WWPN: 50:06:01:6b:47:20:7a:a8
   vmhba1:C0:T2:L20 LUN:20 state:active fc Adapter: WWNN: 20:00:74:86:7a:ae:1c:9c WWPN: 20:01:74:86:7a:ae:1c:9c  Target: WWNN: 50:06:01:60:c7:20:7a:a8 WWPN: 50:06:01:68:47:24:7a:a8
   vmhba2:C0:T3:L20 LUN:20 state:active fc Adapter: WWNN: 20:00:74:86:7a:ae:1c:32 WWPN: 20:01:74:86:7a:ae:1c:32  Target: WWNN: 50:06:01:60:c7:20:7a:a8 WWPN: 50:06:01:63:47:20:7a:a8
   vmhba3:C0:T3:L20 LUN:20 state:active fc Adapter: WWNN: 20:00:74:86:7a:ae:1c:34 WWPN: 20:01:74:86:7a:ae:1c:34  Target: WWNN: 50:06:01:60:c7:20:7a:a8 WWPN: 50:06:01:60:47:24:7a:a8
   vmhba2:C0:T1:L20 LUN:20 state:active fc Adapter: WWNN: 20:00:74:86:7a:ae:1c:32 WWPN: 20:01:74:86:7a:ae:1c:32  Target: WWNN: 50:06:01:60:c7:20:7a:a8 WWPN: 50:06:01:6b:47:20:7a:a8
   vmhba3:C0:T2:L20 LUN:20 state:active fc Adapter: WWNN: 20:00:74:86:7a:ae:1c:34 WWPN: 20:01:74:86:7a:ae:1c:34  Target: WWNN: 50:06:01:60:c7:20:7a:a8 WWPN: 50:06:01:68:47:24:7a:a8
 
Let’s isolate the HBAs so they are easier to visually compare the WWPN of the array targets:
 
vmhba1:
 
   vmhba1:C0:T3:L20 LUN:20 state:active fc Adapter: WWNN: 20:00:74:86:7a:ae:1c:9c WWPN: 20:01:74:86:7a:ae:1c:9c  Target: WWNN: 50:06:01:60:c7:20:7a:a8 WWPN: 50:06:01:60:47:24:7a:a8
   vmhba1:C0:T2:L20 LUN:20 state:active fc Adapter: WWNN: 20:00:74:86:7a:ae:1c:9c WWPN: 20:01:74:86:7a:ae:1c:9c  Target: WWNN: 50:06:01:60:c7:20:7a:a8 WWPN: 50:06:01:68:47:24:7a:a8
 
vmhba3:
 
   vmhba3:C0:T3:L20 LUN:20 state:active fc Adapter: WWNN: 20:00:74:86:7a:ae:1c:34 WWPN: 20:01:74:86:7a:ae:1c:34  Target: WWNN: 50:06:01:60:c7:20:7a:a8 WWPN: 50:06:01:60:47:24:7a:a8
   vmhba3:C0:T2:L20 LUN:20 state:active fc Adapter: WWNN: 20:00:74:86:7a:ae:1c:34 WWPN: 20:01:74:86:7a:ae:1c:34  Target: WWNN: 50:06:01:60:c7:20:7a:a8 WWPN: 50:06:01:68:47:24:7a:a8
 
vmhba1 and vmhba3 are zoned to the exact same array ports yet only vmhba1 is experiencing communication issues/errors.
 
 
Let’s look at the driver information under /proc/scsi/bfa/ by viewing (cat) the node information:
 
Chip Revision: Rev-E
Manufacturer: Brocade
Model Description: Brocade-1741
Instance Num: 0
Serial Num: xxxxxxxxx32
Firmware Version: 3.2.3.2
Hardware Version: Rev-E
Bios Version: 3.2.3.2
Optrom Version: 3.2.3.2
Port Count: 2
WWNN: 20:00:74:86:7a:ae:1c:9a
WWPN: 20:01:74:86:7a:ae:1c:9a
Instance num: 0
Target ID: 0 WWPN: 50:06:01:6b:47:20:7b:04
Target ID: 1 WWPN: 50:06:01:6b:47:20:7a:a8
Target ID: 2 WWPN: 50:06:01:63:47:20:7b:04
Target ID: 3 WWPN: 50:06:01:63:47:20:7a:a8
 
Chip Revision: Rev-E
Manufacturer: Brocade
Model Description: Brocade-1741
Instance Num: 1
Serial Num: xxxxxxxxx32
Firmware Version: 3.2.3.2
Hardware Version: Rev-E
Bios Version: 3.2.3.2
Optrom Version: 3.2.3.2
Port Count: 2
WWNN: 20:00:74:86:7a:ae:1c:9c
WWPN: 20:01:74:86:7a:ae:1c:9c
Instance num: 1
Target ID: 0 WWPN: 50:06:01:60:47:24:7b:04
Target ID: 1 WWPN: 50:06:01:68:47:24:7b:04
Target ID: 3 WWPN: 50:06:01:60:47:24:7a:a8
Target ID: 2 WWPN: 50:06:01:68:47:24:7a:a8
 
Chip Revision: Rev-E
Manufacturer: Brocade
Model Description: Brocade-1741
Instance Num: 2
Serial Num: xxxxxxxxx2E
Firmware Version: 3.2.3.2
Hardware Version: Rev-E
Bios Version: 3.2.3.2
Optrom Version: 3.2.3.2
Port Count: 2
WWNN: 20:00:74:86:7a:ae:1c:32
WWPN: 20:01:74:86:7a:ae:1c:32
Instance num: 2
Target ID: 0 WWPN: 50:06:01:6b:47:20:7b:04
Target ID: 1 WWPN: 50:06:01:6b:47:20:7a:a8
Target ID: 2 WWPN: 50:06:01:63:47:20:7b:04
Target ID: 3 WWPN: 50:06:01:63:47:20:7a:a8
 
Chip Revision: Rev-E
Manufacturer: Brocade
Model Description: Brocade-1741
Instance Num: 3
Serial Num: xxxxxxxxx2E
Firmware Version: 3.2.3.2
Hardware Version: Rev-E
Bios Version: 3.2.3.2
Optrom Version: 3.2.3.2
Port Count: 2
WWNN: 20:00:74:86:7a:ae:1c:34
WWPN: 20:01:74:86:7a:ae:1c:34
Instance num: 3
Target ID: 0 WWPN: 50:06:01:60:47:24:7b:04
Target ID: 1 WWPN: 50:06:01:68:47:24:7b:04
Target ID: 2 WWPN: 50:06:01:68:47:24:7a:a8
Target ID: 3 WWPN: 50:06:01:60:47:24:7a:a8
 
So all HBAs are the same firmware, which is important from a observed consistency perspective. Had the firmware versions been different then there might be something to go on, or at least verify whether there are issues with that firmware level. Obviously they are using the same driver as well since only one is loaded in the kernel.
 
We can see not only by the shared serial number above but also by the lspci output that these are 2 port physical cards:
 
# lspci
 
000:007:00.0 Serial bus controller: Brocade Communications Systems, Inc. Brocade-1010/1020/1007/1741 [vmhba0]
000:007:00.1 Serial bus controller: Brocade Communications Systems, Inc. Brocade-1010/1020/1007/1741 [vmhba1]
000:009:00.0 Serial bus controller: Brocade Communications Systems, Inc. Brocade-1010/1020/1007/1741 [vmhba2]
000:009:00.1 Serial bus controller: Brocade Communications Systems, Inc. Brocade-1010/1020/1007/1741 [vmhba3]
 
The first set of numbers are read as Domain:Bus:Slot.Function so vmhba0 and vmhba1 are both on Domain 0, Bus 7, Slot 0, amd function 0 and 1 respectively, which means it is a dual port HBA.
 
So vmhba0 and vmhba1 are the same physical card yet only vmhba1 is showing errors. The HBA chips themselves on a dual port HBA are mostly independent of each other so at least this means there isn’t a problem with the board or circuitry they both share. I say mostly since the physical ports are independent of each other as well as the HBA chip however they do share the same physical board and connection on the motherboard.
 
This host is running EMC PowerPath VE so we know that in general the I/O loads is evenly distributed across all HBAs and paths evenly. I say in general as PowerPath VE is intelligent enough to use paths that exhibit more errors than other paths less frequently, as well as paths that are more latent.
 
I believe we may be looking at either a cable issue (loose, faulty, or bad GBIC) between vmhba1 and the switch or the switch port itself that vmhba1 is connected to. Here is why:
 
1. vmhba1 is seeing thousands upon thousands of errors while the other HBAs are very quiet
2. vmhba1 and vmhba3 are zoned to the exact same targets yet only vmhba1 is seeing errors
3. vmhba0 and vmhba1 are the same physical card yet only vmhba 1 is seeing errors
 
My recommendation would be to check the physical switch port error counters and possibly replace the cable to see if the errors subside. It is standard practice to reset the switch counters and monitor to ensure errors are still happening so may be needed to do that to validate that the CRC errors or other fabric errors are still occurring.
 
Cheers,
Nathan (twitter account: @vSphereStorage)

Issues creating Desktops in VDI, and what to do about it

Connection Server rebootWe want to highlight some mitigation techniques and a handy KB article today for those of you managing Horizon View desktops. We’re talking about those occasions when no desktop can be created or recomposed in your vdi environment and no tasks submitted from Connection Brokers are acted upon by Virtual Center server.

Our Engineering team has been hard at work fixing many of the underlying causes of this happening and the latest release of View have all but eliminated these issues. However, if these issues show up in latest View releases, then we ask everyone to follow the specific steps in this KB: Pool settings are not saved, new pools cannot be created, and vCenter Server tasks are not processed in a Horizon View environment (2082413)

This KB contains several main steps, the first one of which is collecting the bundled logs from all connection brokers in the vdi environment and recording the time this issue was first observed. Steps 2 to 6 are basic steps that can potentially help address the issue but if issues persist, then step 7 requests opening a support case and submitting the collected bundle logs in step 1 alongside the recorded time when the issue was first observed. You might also include any other useful information, such as whether any recent changes were made to the environment.

When opening your support case, please note this KB article number in the description the case. That helps us get right on point ASAP.

Step 8 is what should address this issue without any connection broker reboot as it causes the stoppage of all View services in all View connection brokers and then restarting them.

If step 8 does not resolve your issue, then the last step (9) involves reboot of all connection serves and this has always addressed the issue in our experience.

Troubleshooting File Level Recovery with vSphere Data Protection – KBTV Webinars

This video is the second in a new series of free Webinars that we are releasing in which our Technical Support staff members present on various topics across a wide range of VMware’s product portfolio.

The title for this presentation is Troubleshooting File Level Recovery with vSphere Data Protection and it dives into some real world examples of how you can troubleshoot file level recovery issues with vSphere Data Protection.

This presentation was originally broadcast live on Thursday 5th March 2015.

To see the details of upcoming webinars in this series, see the Support Insider Blog post at New Free Webinars.

NOTE: This video is 17 minutes in length so it would be worth blocking out some time to watch it!

50 articles that fix EVERYTHING in Horizon View!

Ok, our title may exhibit a teenie-tiny hint of hyperbole, but seriously the following list of articles covers the majority of issues that you can solve yourself. We’ve posted these lists before but limited the number to twenty, but why do that? Your problem might be number twenty one.

  1. Manually deleting linked clones or stale virtual desktop entries from the View Composer database in VMware View Manager and Horizon View (2015112)
  2. Correlating VMware products build numbers to update levels (1014508)
  3. Generating and importing a signed SSL certificate into VMware Horizon View 5.1/5.2/5.3/6.0 using Microsoft Certreq (2032400)
  4. Pool settings are not saved, new pools cannot be created, and vCenter Server tasks are not processed in a Horizon View environment (2082413)
  5. VMware Products and CVE-2014-3566 (POODLE) (2092133)
  6. VMware Horizon View Best Practices (1020305)
  7. Network connectivity requirements for VMware View Manager 4.5 and later (1027217)
  8. Manually deleting replica virtual machines in VMware Horizon View 5.x (1008704)
  9. Finding and removing unused replica virtual machines in the VMware Horizon View (2009844)
  10. Restart order of the View environment to clear ADLDS (ADAM) synchronization in View 4.5, 4.6, 5.0, and 5.1 (2068381)
  11. Collecting diagnostic information for VMware Horizon View (1017939)
  12. View Connection Server reports the error: [ws_TomcatService] STDOUT: java.lang.OutOfMemoryError: Java heap space (2009877)
  13. Legacy applications fail to start with the VMware View 6.0 or 6.0.1 agent installed (2091845)
  14. Provisioning VMware Horizon View desktops fails with error: View Composer Agent initialization error (16): Failed to activate software license. (1026556)
  15. Cannot detach a Persistent Disk in View Manager 4.5 and later (2007076)
  16. Provisioning View desktops fails due to customization timeout errors (2007319)
  17. The View virtual machine is not accessible and the View Administration console shows the virtual machine status as “Already Used” (1000590)
  18. Forcing replication between ADAM databases (1021805)
  19. Administration dashboard in VMware Horizon View 5.1/5.2/5.3 reports the error: Server’s certificate cannot be checked (2000063)
  20. Troubleshooting SSL certificate issues in VMware Horizon View 5.1 and later (2082408)
  21. Removing a standard (replica) connection server or a security server from a cluster of connection/security servers (1010153)
  22. Troubleshooting Persona Management (2008457)
  23. View Persona Management features do not function when Windows Client-Side Caching is in effect. (2016416)
  24. Resolving licensing errors when deploying virtual Office to a system with Office installed natively. (2107369)
  25. Manually deleting linked clones or stale virtual desktop entries from VMware View Manager (1008658)
  26. Troubleshooting a black screen when logging into a Horizon View virtual desktop using PCoIP (1028332)
  27. The PCoIP server log reports the error: Error attaching to SVGADevTap, error 4000: EscapeFailed (1029706)
  28. Connecting to the View ADAM Database (2012377)
  29. Disabling SSLv3 connections over HTTPS to View Security Server and View Connection Server (2094442)
  30. The Event database performance in VMware View 6.0.x is extremely slow (2094580)
  31. Intermittent provisioning issues and generic errors when Composer and vCenter Server are co-installed (2105261)
  32. Performing an end-to-end backup and restore for VMware View Manager (1008046)
  33. Installing VMware View Agent or View Composer fails with the error: The system must be rebooted before installation can continue (1029288)
  34. View Connection Server fails to replicate (2014488)
  35. Provisioning View desktops fail with the error: View Composer Fault: VC operation exceeded the task timeout limit set by View Composer (2030047)
  36. Calculating datastore selection for linked-clone desktops in Horizon View 5.2 or later releases (2047492)
  37. VMware Horizon View Admin dashboard for vCenter Server 5.1 displays the message: VC service is not working properly (2050369)
  38. Generating a Horizon View SSL certificate request using the Microsoft Management Console (MMC) Certificates snap-in (2068666)
  39. Reconnecting to the VDI desktop with PCoIP displays a black screen (2073945)
  40. PCI Scan indicates that TCP Port 4172 PCoIP Secure Gateway is vulnerable to POODLE (CVE-2014-3566) (2099458)
  41. Troubleshooting USB redirection problems in VMware View Manager (1026991)
  42. Location of VMware View log files (1027744)
  43. Migrating linked clone pools to a different or new datastore (1028754)
  44. Error during provisioning: Unable to find folder (1038077)
  45. Unable to connect to a VMware View Manager desktop via the Security Server from outside the firewall (1039021)
  46. VMware View Agent fails to uninstall (2000017)
  47. View Manager Admin console displays the error: Error during provisioning: Unexpected VC fault from View Composer(Unknown)(2014321)
  48. Consolidating disks associated with a backup snapshot fails with the error: The file is already in use (2040846)
  49. Troubleshooting VMware Horizon View HTML Access (2046427)
  50. USB redirection may not work on cloned images after upgrading master image from VMware Horizon View 5.1 to 5.2 and 5.3 (2051801)

Horizon View PCoIP issues?

Here’s our latest top list of KB articles you should know about when encountering issues with PCoIP with Horizon View. It can be a tricky thing to configure and troubleshoot even for the best of us, so here’s some golden nuggets to help you on your way.

Troubleshooting Composer for VMware Horizon View

***UPDATE: We just published a great troubleshooting KB article for Composer that covers a lot of the common issues customers encounter here.

At some point, every View administrator who uses a linked clone pool is going to need to do some troubleshooting. Most linked clone troubleshooting involves a component called Composer.
What is composer, and what does it do?
Composer is an add-on for VMware Horizon View and is used to build linked clone desktops. Details about linked clones and Composer operations can be found in my previous posts, What is a linked clone? and part II of that topic.

Today we will focus on troubleshooting Composer when it breaks.

We are in the process of compiling a KB which will serve as the go-to article for Composer. This will contain links to important KBs, common issues, and procedures for repair. In the meantime, I thought this tactic would be good to share.

Compatibility

Compatibility is more important than many admins realize. VMware builds, tests, certifies and supports components that are built to work together.

The Connection Server talks to:

  • Composer
  • View Agent
  • vCenter
  • Security Server
  • and the clients

Composer in turn, talks to:

  • vCenter,
  • Connection Server
  • The hosts
  • Active Directory
  • The guest OS

As you can imagine, it’s easy for problems to balloon out of control if a component doesn’t talk properly to another. So, you need to ensure that every component is compatible and is designed to work with every other.

Here’s how-

Identifying where a problem exists is the first step to solving it. Composer can be nonfunctional because of factors that are entirely outside of Composer itself. For example, View doesn’t build desktops, that step is done by vCenter through API calls. If you are trying to build desktops and nothing is happening, it doesn’t mean View or Composer are at fault. vCenter needs to be fully functioning properly for View to be able to provision desktops. Along the same lines, Composer needs to be able to talk to all of the hosts in a cluster plus your Active Directory to be able to customize VMs, so if you have a dead host, Composer will fail.

Is Composer at fault then if it doesn’t work? Well, what about the guest VM? Does it get an IP address? Does it boot? If the answer to any of these is no, then Composer can’t do its job.

One of the tactics I take when Composer fails is to manually step though all of the processes involved.

  • Can I clone the base image?
  • Can I change customize it?
  • Can I activate it?
  • Is it network accessible?

If the answer to any of these questions is no, the problem is outside of Composer. Understanding where the linked clone process fails is the key to resolving problems.