Home > Blogs > VMware Support Insider

Troubleshooting VMs that power-off randomly

Nathan Small

Nathan Small

By Nathan Small

A customer opened a case with us because they experienced an issue where their VMs would simply power off randomly with the following message:

“Cannot power on vmserver0424 on esx-esx128.server.company.com. Could not open/create change tracking file”

This was not isolated to a group or type of VM. It didn’t matter which VM or which cluster. If they attempted to power on the VM immediately following the power off event they observed that sometimes the VM would power up again successfully however other times it would fail to do so until they waited a bit longer.

What I will show you in this blog post is how to troubleshoot this type of problem starting with the logs of the virtual machine, thru the ESX host logs, all the way to vCenter and beyond.

The first thing to do is to review the log file for a virtual machine that powers off in this fashion to try to understand what is causing this to occur. For this example we will take VM in the error message we received from vCenter: vmserver0424:

esx-esx128.server.company.com-2014-10-15–18.56/vmfs/volumes/529e2483-9ab715a8-afb5-0017a477208a/vmserver0424/vmware-6.log

2014-10-05T01:27:02.354Z| vcpu-0| I120: AIOMGR: AIOMgr_OpenWithRetry: Descriptor file ‘/vmfs/volumes/529e2483-9ab715a8-afb5-0017a477208a/vmserver0424/vmserver0424_1-000002.vmdk’ locked (try 0)
2014-10-05T01:27:06.661Z| vcpu-0| I120: AIOMGR: AIOMgr_OpenWithRetry: Descriptor file ‘/vmfs/volumes/529e2483-9ab715a8-afb5-0017a477208a/vmserver0424/vmserver0424_1-000002.vmdk’ locked (try 1)
2014-10-05T01:27:10.973Z| vcpu-0| I120: AIOMGR: AIOMgr_OpenWithRetry: Descriptor file ‘/vmfs/volumes/529e2483-9ab715a8-afb5-0017a477208a/vmserver0424/vmserver0424_1-000002.vmdk’ locked (try 2)
2014-10-05T01:27:15.283Z| vcpu-0| I120: AIOMGR: AIOMgr_OpenWithRetry: Descriptor file ‘/vmfs/volumes/529e2483-9ab715a8-afb5-0017a477208a/vmserver0424/vmserver0424_1-000002.vmdk’ locked (try 3)
2014-10-05T01:27:19.693Z| vcpu-0| I120: AIOMGR: AIOMgr_OpenWithRetry: Descriptor file ‘/vmfs/volumes/529e2483-9ab715a8-afb5-0017a477208a/vmserver0424/vmserver0424_1-000002.vmdk’ locked (try 4)
2014-10-05T01:27:24.002Z| vcpu-0| I120: OBJLIB-FILEBE : FileBEOpen: can’t open ‘/vmfs/volumes/529e2483-9ab715a8-afb5-0017a477208a/vmserver0424/vmserver0424_1-000002.vmdk’ : Failed to lock the file (262146).
2014-10-05T01:27:24.002Z| vcpu-0| I120: DISKLIB-DSCPTR: DescriptorDetermineType: failed to open ‘/vmfs/volumes/529e2483-9ab715a8-afb5-0017a477208a/vmserver0424/vmserver0424_1-000002.vmdk’: 2014-10-05T01:27:24.002Z| vcpu-0| I120: DISKLIB-LINK : “/vmfs/volumes/529e2483-9ab715a8-afb5-0017a477208a/vmserver0424/vmserver0424_1-000002.vmdk” : failed to open (Failed to lock the file).
2014-10-05T01:27:24.002Z| vcpu-0| I120: DISKLIB-CHAIN : “/vmfs/volumes/529e2483-9ab715a8-afb5-0017a477208a/vmserver0424/vmserver0424_1-000002.vmdk” : failed to open (Failed to lock the file).
2014-10-05T01:27:24.002Z| vcpu-0| I120: DISKLIB-LIB : Failed to open ‘/vmfs/volumes/529e2483-9ab715a8-afb5-0017a477208a/vmserver0424/vmserver0424_1-000002.vmdk’ with flags 0xa Failed to lock the file (16392).
2014-10-05T01:27:24.002Z| vcpu-0| I120: DISK: Cannot open disk “/vmfs/volumes/529e2483-9ab715a8-afb5-0017a477208a/vmserver0424/vmserver0424_1-000002.vmdk”: Failed to lock the file (16392).
2014-10-05T01:27:24.002Z| vcpu-0| I120: Msg_Post: Error
2014-10-05T01:27:24.002Z| vcpu-0| I120: [msg.fileio.lock] Failed to lock the file
2014-10-05T01:27:24.002Z| vcpu-0| I120: [msg.disk.noBackEnd] Cannot open the disk ‘/vmfs/volumes/529e2483-9ab715a8-afb5-0017a477208a/vmserver0424/vmserver0424_1-000002.vmdk’ or one of the snapshot disks it depends on.
2014-10-05T01:27:24.002Z| vcpu-0| I120: —————————————-
2014-10-05T01:27:24.015Z| vcpu-0| I120: VMXNET3 user: Ethernet0 Driver Info: version = 833450 gosBits = 2 gosType = 2, gosVer = 24848, gosMisc = 212
2014-10-05T01:27:24.032Z| vcpu-0| I120: Msg_Post: Error
2014-10-05T01:27:24.032Z| vcpu-0| I120: [msg.checkpoint.continuesync.error] An operation required the virtual machine to quiesce and the virtual machine was unable to continue running.
2014-10-05T01:27:24.032Z| vcpu-0| I120: —————————————-
2014-10-05T01:27:24.039Z| vcpu-0| I120: Msg_Post: Error
2014-10-05T01:27:24.039Z| vcpu-0| I120: [msg.poweroff.commitOn] Performing disk cleanup. Cannot power off.
2014-10-05T01:27:24.039Z| vcpu-0| I120: —————————————-
2014-10-05T01:27:24.042Z| vcpu-0| I120: SnapshotVMXConsolidateOnlineCB: nextState = 2 uid 598
2014-10-05T01:27:24.042Z| vcpu-0| I120: Vix: [15205934 mainDispatch.c:3870]: VMAutomation_ReportPowerOpFinished: statevar=3, newAppState=1881, success=1 additionalError=0
2014-10-05T01:27:24.042Z| vcpu-0| I120: Msg_Post: Error
2014-10-05T01:27:24.042Z| vcpu-0| I120: [msg.snapshotVMX.consolidate.cptError] An error occurred while consolidating disks. The virtual machine will be powered off.
2014-10-05T01:27:24.042Z| vcpu-0| I120: —————————————-
2014-10-05T01:27:24.045Z| vcpu-0| I120: Vix: [15205934 vigorCommands.c:548]: VigorSnapshotManagerConsolidateCallback: snapshotErr = Unable to save snapshot file (13:0)
2014-10-05T01:27:24.045Z| vcpu-0| I120: SnapshotVMXConsolidateOnlineCB: Destroying thread 12
2014-10-05T01:27:24.045Z| vcpu-0| I120: Turning off snapshot info cache.
2014-10-05T01:27:24.047Z| vcpu-0| I120: Turning off snapshot disk cache.
2014-10-05T01:27:24.047Z| vcpu-0| I120: SnapshotVMXConsolidateOnlineCB: Done with consolidate
2014-10-05T01:27:24.047Z| vmx| I120: Stopping VCPU threads…

While there are numerous errors reported here they all stem from the fact that after 4 retries the VMM (Virtual Machine Monitor) is unable to open a file it had previous had access to. Since the VMM process is the only process that should be trying to access this file, this behavior is odd as it appears that something else has a lock on this file.

Next power on fails immediately:

2014-10-05T01:49:08.752Z| Worker#1| I120: AIOMGR: AIOMgr_OpenWithRetry: Descriptor file ‘/vmfs/volumes/529e2483-9ab715a8-afb5-0017a477208a/vmserver0424/vmserver0424_1-000002.vmdk’ locked (try 0)
2014-10-05T01:49:13.061Z| Worker#1| I120: AIOMGR: AIOMgr_OpenWithRetry: Descriptor file ‘/vmfs/volumes/529e2483-9ab715a8-afb5-0017a477208a/vmserver0424/vmserver0424_1-000002.vmdk’ locked (try 1)
2014-10-05T01:49:17.371Z| Worker#1| I120: AIOMGR: AIOMgr_OpenWithRetry: Descriptor file ‘/vmfs/volumes/529e2483-9ab715a8-afb5-0017a477208a/vmserver0424/vmserver0424_1-000002.vmdk’ locked (try 2)
2014-10-05T01:49:21.680Z| Worker#1| I120: AIOMGR: AIOMgr_OpenWithRetry: Descriptor file ‘/vmfs/volumes/529e2483-9ab715a8-afb5-0017a477208a/vmserver0424/vmserver0424_1-000002.vmdk’ locked (try 3)
2014-10-05T01:49:26.017Z| Worker#1| I120: AIOMGR: AIOMgr_OpenWithRetry: Descriptor file ‘/vmfs/volumes/529e2483-9ab715a8-afb5-0017a477208a/vmserver0424/vmserver0424_1-000002.vmdk’ locked (try 4)
2014-10-05T01:49:30.330Z| Worker#1| I120: OBJLIB-FILEBE : FileBEOpen: can’t open ‘/vmfs/volumes/529e2483-9ab715a8-afb5-0017a477208a/vmserver0424/vmserver0424_1-000002.vmdk’ : Failed to lock the file (262146).
2014-10-05T01:49:30.330Z| Worker#1| I120: DISKLIB-DSCPTR: DescriptorDetermineType: failed to open ‘/vmfs/volumes/529e2483-9ab715a8-afb5-0017a477208a/vmserver0424/vmserver0424_1-000002.vmdk’: Failed to lock the file (40002)
2014-10-05T01:49:30.330Z| Worker#1| I120: DISKLIB-LINK : “/vmfs/volumes/529e2483-9ab715a8-afb5-0017a477208a/vmserver0424/vmserver0424_1-000002.vmdk” : failed to open (Failed to lock the file).
2014-10-05T01:49:30.330Z| Worker#1| I120: DISKLIB-CHAIN : “/vmfs/volumes/529e2483-9ab715a8-afb5-0017a477208a/vmserver0424/vmserver0424_1-000002.vmdk” : failed to open (Failed to lock the file).
2014-10-05T01:49:30.330Z| Worker#1| I120: DISKLIB-LIB : Failed to open ‘/vmfs/volumes/529e2483-9ab715a8-afb5-0017a477208a/vmserver0424/vmserver0424_1-000002.vmdk’ with flags 0xa Failed to lock the file (16392).
2014-10-05T01:49:30.330Z| Worker#1| I120: DISK: Cannot open disk “/vmfs/volumes/529e2483-9ab715a8-afb5-0017a477208a/vmserver0424/vmserver0424_1-000002.vmdk”: Failed to lock the file (16392).
2014-10-05T01:49:30.330Z| Worker#1| I120: Msg_Post: Error
2014-10-05T01:49:30.330Z| Worker#1| I120: [msg.fileio.lock] Failed to lock the file
2014-10-05T01:49:30.330Z| Worker#1| I120: [msg.disk.noBackEnd] Cannot open the disk ‘/vmfs/volumes/529e2483-9ab715a8-afb5-0017a477208a/vmserver0424/vmserver0424_1-000002.vmdk’ or one of the snapshot disks it depends on.

Let’s take another VM as an example and find out why the VMM process is unable to open the file:

esx-esx134.server.company.com-2014-10-15–16.45/vmfs/volumes/52d076c8-5d86489a-880a-0017a477208a/vmserver0420/vmware-5.log

“2014-09-30T05:58:52.233Z| vcpu-0| I120: DISKLIB-LIB : Failed to open ‘/vmfs/volumes/52d076c8-5d86489a-880a-0017a477208a/vmserver0420/vmserver0420_1-000002.vmdk’ with flags 0xa Failed to lock the file (16392).”

When an attempt to access a file fails we may see corresponding events in /var/log/vmkernel.log. The above VM observed an inability to open the file @ 05:58:52 UTC:

/var/log/vmkernel.log:

2014-09-30T05:58:48.229Z cpu4:6413913)DLX: 3661: vol ‘Production-B-Datastore-14′, lock at 216551424: [Req mode 2] Checking liveness:
2014-09-30T05:58:48.229Z cpu4:6413913)[type 10c00001 offset 216551424 v 10376, hb offset 3284992
gen 5, mode 1, owner 52cb3769-8777cf73-e99e-0017a4772890 mtime 71328484
num 0 gblnum 0 gblgen 0 gblbrk 0]
2014-09-30T05:58:52.233Z cpu4:6413913)DLX: 4185: vol ‘Production-B-Datastore-14′, lock at 216551424: [Req mode: 2] Not free:
2014-09-30T05:58:52.233Z cpu4:6413913)[type 10c00001 offset 216551424 v 10376, hb offset 3284992
gen 5, mode 1, owner 52cb3769-8777cf73-e99e-0017a4772890 mtime 71328484
num 0 gblnum 0 gblgen 0 gblbrk 0]
2014-09-30T05:58:53.109Z cpu22:9723)DLX: 3661: vol ‘Production-B-Datastore-14′, lock at 216551424: [Req mode 1] Checking liveness:
2014-09-30T05:58:53.109Z cpu22:9723)[type 10c00001 offset 216551424 v 10376, hb offset 3284992
gen 5, mode 1, owner 52cb3769-8777cf73-e99e-0017a4772890 mtime 71328484
num 0 gblnum 0 gblgen 0 gblbrk 0]

These messages indicate that we are checking the liveness of a file lock on volume ‘Production-B-Datastore-14′ to see if it is stale and can be freed. 4 seconds later we can see that the lock has not been freed yet.

If we take the lock address “216551424″ and looks this up in the VMFS metadata then we would learn that the file in question is ‘vmserver0420_1-000002.vmdk’. More importantly we can see the UUID of the ESX host that has the open lock, which is owner ’52cb3769-8777cf73-e99e-0017a4772890′. In order to determine which physical ESX server this is you need to take that last part of the UUID which is the MAC address of the first vmkernel port (vmk0). We can do this across the ESX host log bundles with the following command:

grep “00:17:a4:77:28:90″ esx-esx*/commands/nicinfo*
esx-esx130.server.company.com-2014-10-15–16.50/commands/nicinfo.sh.txt:vmnic0 0000:004:00.0 be2net Up 10000 Full 00:17:a4:77:28:90 9000 Emulex Corporation HP FlexFabric 10Gb 2-port 554FLB Adapter

Unfortunately the vmkernel logs on this host do not go back far enough to see the timeframe in question however a very strange behavior observed in the /var/log/vmkernel.log:

2014-10-12T10:09:56.885Z cpu15:23596544)FS3: 1250: vol ‘PCI-Migration’, lock at 47935488: [Req mode: 1] Checking liveness:
2014-10-12T10:09:56.885Z cpu15:23596544)[type 10c00001 offset 47935488 v 1704, hb offset 3284992
gen 42893, mode 2, owner 00000000-00000000-0000-000000000000 mtime 662559
num 1 gblnum 0 gblgen 0 gblbrk 0]
2014-10-12T10:09:56.903Z cpu23:9880)FS3: 1250: vol ‘PCI-Migration’, lock at 55998464: [Req mode: 1] Checking liveness:
2014-10-12T10:09:56.903Z cpu23:9880)[type 10c00001 offset 55998464 v 2941, hb offset 3284992
gen 42893, mode 2, owner 00000000-00000000-0000-000000000000 mtime 1860122
num 1 gblnum 0 gblgen 0 gblbrk 0]
2014-10-12T10:09:56.915Z cpu25:23303194)DLX: 3661: vol ‘Production-B-Datastore-14′, lock at 194387968: [Req mode 1] Checking liveness:
2014-10-12T10:09:56.915Z cpu25:23303194)[type 10c00001 offset 194387968 v 283, hb offset 3284992
gen 5, mode 2, owner 00000000-00000000-0000-000000000000 mtime 3146551
num 1 gblnum 0 gblgen 0 gblbrk 0]
2014-10-12T10:09:57.677Z cpu6:9875)FS3: 1393: vol ‘PCI-Migration’, lock at 47556608: [Req mode: 1] Not free:
2014-10-12T10:09:57.677Z cpu6:9875)[type 10c00001 offset 47556608 v 1880, hb offset 3284992
gen 42893, mode 2, owner 00000000-00000000-0000-000000000000 mtime 661885
num 1 gblnum 0 gblgen 0 gblbrk 0]

The vmkernel logs are completely filled with these lock holder messages which is why the log file had rotated so fast:

2014-10-12T09:46:28.305Z cpu2:9878)DLX: 3661: vol ‘Production-B-Datastore-8′, lock at 232120320: [Req mode 1] Checking liveness:
2014-10-12T09:46:28.305Z cpu2:9878)[type 10c00001 offset 232120320 v 9052, hb offset 3284992
gen 5, mode 2, owner 00000000-00000000-0000-000000000000 mtime 2863190
num 1 gblnum 0 gblgen 0 gblbrk 0]
2014-10-12T09:46:28.514Z cpu2:9868)DLX: 4185: vol ‘Production-B-Datastore-14′, lock at 193937408: [Req mode: 1] Not free:
2014-10-12T09:46:28.514Z cpu2:9868)[type 10c00001 offset 193937408 v 1488, hb offset 3284992
gen 5, mode 2, owner 00000000-00000000-0000-000000000000 mtime 3125883
num 1 gblnum 0 gblgen 0 gblbrk 0]
2014-10-12T09:46:28.527Z cpu2:9868)DLX: 3661: vol ‘Production-B-Datastore-14′, lock at 194019328: [Req mode 1] Checking liveness:
2014-10-12T09:46:28.527Z cpu2:9868)[type 10c00001 offset 194019328 v 1769, hb offset 3428352
gen 5, mode 2, owner 00000000-00000000-0000-000000000000 mtime 3124506
num 1 gblnum 0 gblgen 0 gblbrk 0]
2014-10-12T09:46:28.540Z cpu2:24477994)DLX: 4185: vol ‘Production-B-Datastore-14′, lock at 194785280: [Req mode: 1] Not free:
2014-10-12T09:46:28.540Z cpu2:24477994)[type 10c00001 offset 194785280 v 1901, hb offset 3284992
gen 5, mode 2, owner 00000000-00000000-0000-000000000000 mtime 3124798
num 1 gblnum 0 gblgen 0 gblbrk 0]
2014-10-12T09:46:28.628Z cpu2:24477994)DLX: 3661: vol ‘Production-B-Datastore-14′, lock at 194791424: [Req mode 1] Checking liveness:
2014-10-12T09:46:28.628Z cpu2:24477994)[type 10c00001 offset 194791424 v 1909, hb offset 3284992
gen 5, mode 2, owner 00000000-00000000-0000-000000000000 mtime 3125256
num 1 gblnum 0 gblgen 0 gblbrk 0]
2014-10-12T09:46:28.864Z cpu4:9879)DLX: 4185: vol ‘Production-B-Datastore-14′, lock at 194242560: [Req mode: 1] Not free:
2014-10-12T09:46:28.864Z cpu4:9879)[type 10c00001 offset 194242560 v 1168, hb offset 3284992
gen 5, mode 2, owner 00000000-00000000-0000-000000000000 mtime 3124979
num 1 gblnum 0 gblgen 0 gblbrk 0]
2014-10-12T09:46:28.983Z cpu4:9879)DLX: 3661: vol ‘Production-B-Datastore-14′, lock at 194248704: [Req mode 1] Checking liveness:
2014-10-12T09:46:28.983Z cpu4:9879)[type 10c00001 offset 194248704 v 1186, hb offset 3284992
gen 5, mode 2, owner 00000000-00000000-0000-000000000000 mtime 3125436
num 1 gblnum 0 gblgen 0 gblbrk 0]
2014-10-12T09:46:29.240Z cpu15:23764342)FS3: 1393: vol ‘PCI-Migration’, lock at 46905344: [Req mode: 1] Not free:
2014-10-12T09:46:29.240Z cpu15:23764342)[type 10c00001 offset 46905344 v 2160, hb offset 3284992
gen 42893, mode 2, owner 00000000-00000000-0000-000000000000 mtime 662629
num 1 gblnum 0 gblgen 0 gblbrk 0]

Here we can see numerous lock checking for different datastore names and file lock addresses, all within the same second. It appears that this host is aggressively trying to check file lock liveliness, which is certainly unusual behavior.

Let’s review another VM that powered off in another cluster to see if the behavior is the same across clusters:

esx-esx0040.server.company.com-2014-10-21–20.20/vmfs/volumes/53694ec4-999f3bd0-156a-9c8e9920e2b0:
VM: vmserver0260

2014-10-17T13:59:27.457Z| vmx| I120: DISKLIB-LIB : Resuming change tracking.
2014-10-17T13:59:27.461Z| vmx| I120: DISKLIB-CBT : Initializing ESX kernel change tracking for fid 337992562.
2014-10-17T13:59:27.462Z| vmx| I120: DISKLIB-CBT : Successfuly created cbt node 18605b78-cbt.
2014-10-17T13:59:27.462Z| vmx| I120: DISKLIB-CBT : Opening cbt node /vmfs/devices/cbt/18605b78-cbt
2014-10-17T13:59:27.462Z| vmx| I120: DISKLIB-LIB : Opened “/vmfs/volumes/53694ec4-999f3bd0-156a-9c8e9920e2b0/vmserver0260/vmserver0260_5-000002.vmdk” (flags 0xa, type vmfs).
2014-10-17T13:59:27.462Z| vmx| I120: DISK: Disk ‘/vmfs/volumes/53694ec4-999f3bd0-156a-9c8e9920e2b0/vmserver0260/vmserver0260_5-000002.vmdk’ has UUID ’60 00 c2 97 27 69 c9 e3-e4 0f 02 13 a1 6d 4c 6d’
2014-10-17T13:59:27.462Z| vmx| I120: DISK: OPEN ‘/vmfs/volumes/53694ec4-999f3bd0-156a-9c8e9920e2b0/vmserver0260/vmserver0260_5-000002.vmdk’ Geo (1697/255/63) BIOS Geo (0/0/0)
2014-10-17T13:59:31.456Z| Worker#0| I120: DISKLIB-CTK : Could not open tracking file. File open returned IO error 4.
2014-10-17T13:59:31.456Z| Worker#0| I120: DISKLIB-CTK : Could not open change tracking file “/vmfs/volumes/53694ec4-999f3bd0-156a-9c8e9920e2b0/vmserver0260/vmserver0260-000002-ctk.vmdk”: Could not open/create change tracking file.
2014-10-17T13:59:31.458Z| Worker#0| I120: DISKLIB-LIB : Could not open change tracker /vmfs/volumes/53694ec4-999f3bd0-156a-9c8e9920e2b0/vmserver0260/vmserver0260-000002-ctk.vmdk: Could not open/create change tracking file.
2014-10-17T13:59:31.461Z| Worker#0| I120: DISKLIB-VMFS : “/vmfs/volumes/53694ec4-999f3bd0-156a-9c8e9920e2b0/vmserver0260/vmserver0260-000002-delta.vmdk” : closed.
2014-10-17T13:59:31.462Z| Worker#0| I120: DISKLIB-VMFS : “/vmfs/volumes/53694ec4-999f3bd0-156a-9c8e9920e2b0/vmserver0260/vmserver0260-000001-delta.vmdk” : closed.
2014-10-17T13:59:31.464Z| Worker#0| I120: DISKLIB-VMFS : “/vmfs/volumes/53694ec4-999f3bd0-156a-9c8e9920e2b0/vmserver0260/vmserver0260-flat.vmdk” : closed.
2014-10-17T13:59:31.464Z| Worker#0| I120: DISKLIB-LIB : Failed to open ‘/vmfs/volumes/53694ec4-999f3bd0-156a-9c8e9920e2b0/vmserver0260/vmserver0260-000002.vmdk’ with flags 0xa Could not open/create change tracking file (2108).
2014-10-17T13:59:31.464Z| Worker#0| I120: DISK: Cannot open disk “/vmfs/volumes/53694ec4-999f3bd0-156a-9c8e9920e2b0/vmserver0260/vmserver0260-000002.vmdk”: Could not open/create change tracking file (2108).
2014-10-17T13:59:31.467Z| vmx| I120: DISK: Opening disks took 4051 ms.

This error message is slightly different as this time the concern is the change block tracking file instead of the delta snapshot file. Let’s see if the vmkernel.log has the same events:

vmkernel.log:

2014-10-17T13:59:27.454Z cpu18:16729934)DLX: 3733: vol ‘Non-Production-A-Datastore-VMFS5-4′, lock at 71542784: [Req mode 1] Checking liveness:
2014-10-17T13:59:27.454Z cpu18:16729934)[type 10c00001 offset 71542784 v 37061, hb offset 3190784
gen 3, mode 1, owner 53e0181c-c8a6435e-55b0-0017a4770c24 mtime 28048957
num 0 gblnum 0 gblgen 0 gblbrk 0]
2014-10-17T13:59:27.462Z cpu12:16729808)CBT: 2214: Created device 18605b78-cbt for cbt driver with filehandle 408968056
2014-10-17T13:59:27.462Z cpu12:16729808)WARNING: CBT: 2039: Unsupported ioctl 60
2014-10-17T13:59:27.462Z cpu12:16729808)WARNING: CBT: 2039: Unsupported ioctl 59
2014-10-17T13:59:31.456Z cpu18:16729934)DLX: 4262: vol ‘Non-Production-A-Datastore-VMFS5-4′, lock at 71542784: [Req mode: 1] Not free:
2014-10-17T13:59:31.456Z cpu18:16729934)[type 10c00001 offset 71542784 v 37061, hb offset 3190784
gen 3, mode 1, owner 53e0181c-c8a6435e-55b0-0017a4770c24 mtime 28048957
num 0 gblnum 0 gblgen 0 gblbrk 0]

When crosschecking the lock @ address “71542784″ on volume ‘Non-Production-A-Datastore-VMFS5-4′ we see that the file in question is ‘vmserver0260-000002-ctk.vmdk’

Now we need to determine which ESX host is the owner of that lock that will not free. Please see KB 10051 for more information: http://kb.vmware.com/kb/10051

grep “00:17:a4:77:0c:24″ esx-esx0*/commands/nicinfo*
esx-esx0044.server.company.com-2014-10-21–17.31/commands/nicinfo.sh.txt:vmnic0 0000:002:00.0 elxnet Up 10000 Full 00:17:a4:77:0c:24 1500 Emulex Corporation HP NC553i Dual Port
FlexFabric 10Gb Converged Network Adapter

When we review the /var/log/vmkernel.log of host ‘esx0044.server.company.com’ we see a flood of lock holder checks on this host too:

2014-10-21T17:26:35.283Z cpu14:35211)DLX: 4262: vol ‘Non-Production-A-Datastore-VMFS5-17′, lock at 71235584: [Req mode: 1] Not free:
2014-10-21T17:26:35.283Z cpu14:35211)[type 10c00001 offset 71235584 v 301, hb offset 3907584
gen 5, mode 2, owner 00000000-00000000-0000-000000000000 mtime 13817937
num 1 gblnum 0 gblgen 0 gblbrk 0]
2014-10-21T17:26:38.265Z cpu4:135822)DLX: 4262: vol ‘Non-Production-A-Datastore-VMFS5-15′, lock at 262238208: [Req mode: 1] Not free:
2014-10-21T17:26:38.265Z cpu4:135822)[type 10c00001 offset 262238208 v 9643, hb offset 3674112
gen 5, mode 1, owner 53e11584-d51063de-48a9-9c8e99227ee8 mtime 7918228
num 0 gblnum 0 gblgen 0 gblbrk 0]
2014-10-21T17:26:38.314Z cpu13:7072422)DLX: 4262: vol ‘Non-Production-A-Datastore-VMFS5-16′, lock at 130293760: [Req mode: 1] Not free:
2014-10-21T17:26:38.314Z cpu13:7072422)[type 10c00001 offset 130293760 v 10072, hb offset 3989504

2014-10-21T17:26:39.361Z cpu14:35211)DLX: 4262: vol 'Non-Production-A-Datastore-VMFS5-17', lock at 71241728: [Req mode: 1] Not free:
2014-10-21T17:26:39.361Z cpu14:35211)[type 10c00001 offset 71241728 v 363, hb offset 3907584
gen 5, mode 2, owner 00000000-00000000-0000-000000000000 mtime 13817934
num 1 gblnum 0 gblgen 0 gblbrk 0]
2014-10-21T17:26:40.070Z cpu20:81812)World: 14296: VC opID hostd-4320 maps to vmkernel opID f2f9cb96
2014-10-21T17:26:42.283Z cpu4:135822)DLX: 4262: vol ‘Non-Production-A-Datastore-VMFS5-15′, lock at 262311936: [Req mode: 1] Not free:
2014-10-21T17:26:42.283Z cpu4:135822)[type 10c00001 offset 262311936 v 7749, hb offset 3674112
gen 5, mode 2, owner 00000000-00000000-0000-000000000000 mtime 7917330
num 1 gblnum 0 gblgen 0 gblbrk 0]
2014-10-21T17:26:42.350Z cpu14:7072422)DLX: 4262: vol ‘Non-Production-A-Datastore-VMFS5-16′, lock at 130297856: [Req mode: 1] Not free:
2014-10-21T17:26:42.350Z cpu14:7072422)[type 10c00001 offset 130297856 v 10081, hb offset 3989504
gen 3, mode 1, owner 53e3b83d-2b1da026-14d6-0017a4770826 mtime 10085761
num 0 gblnum 0 gblgen 0 gblbrk 0]
2014-10-21T17:26:43.442Z cpu20:35211)DLX: 4262: vol ‘Non-Production-A-Datastore-VMFS5-17′, lock at 71247872: [Req mode: 1] Not free:
2014-10-21T17:26:43.442Z cpu20:35211)[type 10c00001 offset 71247872 v 330, hb offset 3190784
gen 5, mode 2, owner 00000000-00000000-0000-000000000000 mtime 13817948
num 1 gblnum 0 gblgen 0 gblbrk 0]
2014-10-21T17:26:46.322Z cpu0:135822)DLX: 4262: vol ‘Non-Production-A-Datastore-VMFS5-15′, lock at 262320128: [Req mode: 1] Not free:
2014-10-21T17:26:46.322Z cpu0:135822)[type 10c00001 offset 262320128 v 7757, hb offset 3674112
gen 5, mode 2, owner 00000000-00000000-0000-000000000000 mtime 7917344
num 1 gblnum 0 gblgen 0 gblbrk 0]
2014-10-21T17:26:46.357Z cpu17:7072422)DLX: 4262: vol ‘Non-Production-A-Datastore-VMFS5-16′, lock at 129902592: [Req mode: 1] Not free:
2014-10-21T17:26:46.357Z cpu17:7072422)[type 10c00001 offset 129902592 v 10080, hb offset 3989504
gen 3, mode 1, owner 53e3b83d-2b1da026-14d6-0017a4770826 mtime 10085764
num 0 gblnum 0 gblgen 0 gblbrk 0]

Now we need to determine what process or ‘world’ on this host is requesting whether a lock is stale. When we check world ’35211′ we learn that it is process ‘vpxa-wroker’. Why is vpxa-worker trying to open files? What is vpxa-worker trying to do in general? Let’s take a look at the /var/log/vpxa.log file for more information:

2014-10-21T17:34:40.090Z [FFFD4B70 info 'commonvpxLro' opID=190a874-bf] [VpxLRO] — BEGIN task-internal-375725 — – vpxapi.VpxaService.searchDatastore — 04fe2fb8-71f9-c0a4-d42c-2812b96c7fb6
2014-10-21T17:34:40.090Z [FFFD4B70 verbose 'vpxavpxaMoService' opID=190a874-bf] SearchDatastore: ds:///vmfs/volumes/53f257d5-11293746-04b5-0017a477084e/ true [N3Vim4Host16DatastoreBrowser10SearchSpecE:0x1f3d72f8]
2014-10-21T17:34:40.090Z [FFFD4B70 verbose 'vpxavpxaDatastoreContext' opID=190a874-bf] [VpxaDatastoreContext] Resolved URL ds:///vmfs/volumes/53f257d5-11293746-04b5-0017a477084e/ to localPath /vmfs/volumes/53f257d5-11293746-04b5-0017a477084e/
2014-10-21T17:34:40.124Z [FFFD4B70 info 'vpxavpxaVmprov' opID=190a874-bf] [VpxaVmprovUtil::IsTopLevelObjectStorePath] path /vmfs/volumes/53f257d5-11293746-04b5-0017a477084e/ has fstype: 62768
2014-10-21T17:34:40.124Z [FFFD4B70 verbose 'vpxavpxaVmprovUtil' opID=190a874-bf] [VpxaVmprovUtil] UrlToDatastoreUrl conversion: ds:///vmfs/volumes/53f257d5-11293746-04b5-0017a477084e/ -> ds:///vmfs/volumes/53f257d5-11293746-04b5-0017a477084e/
2014-10-21T17:34:40.728Z [24983B70 info 'Libs' opID=1faf940d-66] OBJLIB-FILEBE : FileBEOpen: can’t open ‘/vmfs/volumes/53f25ab0-c16dcbc9-b38e-0017a477084e/fldcvisla0007/fldcvisla0007_1-flat.vmdk’ : Failed to lock the file (262146).
2014-10-21T17:34:40.728Z [24983B70 info 'DiskLib' opID=1faf940d-66] DISKLIB-DSCPTR: DescriptorOpenInt: failed to open ‘/vmfs/volumes/53f25ab0-c16dcbc9-b38e-0017a477084e/fldcvisla0007/fldcvisla0007_1-flat.vmdk’: Failed to lock the file (40002)
2014-10-21T17:34:40.728Z [24983B70 info 'DiskLib' opID=1faf940d-66] DISKLIB-LINK : “/vmfs/volumes/53f25ab0-c16dcbc9-b38e-0017a477084e/fldcvisla0007/fldcvisla0007_1-flat.vmdk” : failed to open (Failed to lock the file).
2014-10-21T17:34:40.731Z [24983B70 verbose 'Default' opID=1faf940d-66] [Search::GetInfo] Successfully opened disk: /vmfs/volumes/53f25ab0-c16dcbc9-b38e-0017a477084e/fldcvisla0007/fldcvisla0007_1.vmdk
2014-10-21T17:34:40.803Z [24983B70 info 'DiskLib' opID=1faf940d-66] DISKLIB-LINK : “/vmfs/volumes/53f25ab0-c16dcbc9-b38e-0017a477084e/fldcvisla0007/fldcvisla0007_1-ctk.vmdk” : failed to open (The file specified is not a virtual disk).

Why is the vpxa process kicking off ‘vpxapi.VpxaService.searchDatastore’ API requests? It is attempting to open a lot of different vmdk files which is failing because they are locked, likely because those virtual machines are running on other hosts. What is interesting it is attempting to open change block track files as well except it is trying to open them as virtual machine disks however these not actual virtual machine disks and merely ctk files. Yet again, very odd behavior. At this point I suspect that these API calls are likely coming from a 3rd party piece of software and not something native to our product. Let’s check the vCenter logs for more information:

2014-05-20T20:45:40.137-04:00 [08928 warning 'VpxProfiler' opID=1ea7d388] ClientAdapterBase::InvokeOnSoap: (esx130.server.company.com, vpxapi.VpxaService.searchDatastore) [SoapRpcTime] took 89590 ms
2014-05-20T20:45:40.137-04:00 [08928 info 'vpxdvpxdVmomi' opID=1ea7d388] [ClientAdapterBase::InvokeOnSoap] Invoke done (esx130.server.company.com, vpxapi.VpxaService.searchDatastore)
2014-05-20T20:45:44.583-04:00 [09416 warning 'VpxProfiler' opID=43475cf] ClientAdapterBase::InvokeOnSoap: (esx130.server.company.com, vpxapi.VpxaService.searchDatastore) [SoapRpcTime] took 163645 ms
2014-05-20T20:45:44.583-04:00 [09416 info 'vpxdvpxdVmomi' opID=43475cf] [ClientAdapterBase::InvokeOnSoap] Invoke done (esx130.server.company.com, vpxapi.VpxaService.searchDatastore)
2014-05-20T20:45:48.982-04:00 [09424 warning 'VpxProfiler' opID=eba32ae4] ClientAdapterBase::InvokeOnSoap: (esx098.server.company.com, vpxapi.VpxaService.searchDatastore) [SoapRpcTime] took 129044 ms
2014-05-20T20:45:48.982-04:00 [09424 info 'vpxdvpxdVmomi' opID=eba32ae4] [ClientAdapterBase::InvokeOnSoap] Invoke done (esx098.server.company.com, vpxapi.VpxaService.searchDatastore)
2014-05-20T20:46:12.958-04:00 [07708 warning 'VpxProfiler' opID=632083eb] ClientAdapterBase::InvokeOnSoap: (esx098.server.company.com, vpxapi.VpxaService.searchDatastore) [SoapRpcTime] took 182271 ms
2014-05-20T20:46:12.958-04:00 [07708 info 'vpxdvpxdVmomi' opID=632083eb] [ClientAdapterBase::InvokeOnSoap] Invoke done (esx098.server.company.com, vpxapi.VpxaService.searchDatastore)
2014-05-20T20:46:18.465-04:00 [09412 warning 'VpxProfiler' opID=6d6d6500] ClientAdapterBase::InvokeOnSoap: (esx098.server.company.com, vpxapi.VpxaService.searchDatastore) [SoapRpcTime] took 193924 ms
2014-05-20T20:46:18.465-04:00 [09412 info 'vpxdvpxdVmomi' opID=6d6d6500] [ClientAdapterBase::InvokeOnSoap] Invoke done (esx098.server.company.com, vpxapi.VpxaService.searchDatastore)
2014-05-20T20:47:41.128-04:00 [08808 warning 'VpxProfiler' opID=ecaec22f] ClientAdapterBase::InvokeOnSoap: (esx130.server.company.com, vpxapi.VpxaService.searchDatastore) [SoapRpcTime] took 252143 ms
2014-05-20T20:47:41.128-04:00 [08808 info 'vpxdvpxdVmomi' opID=ecaec22f] [ClientAdapterBase::InvokeOnSoap] Invoke done (esx130.server.company.com, vpxapi.VpxaService.searchDatastore)
2014-05-20T20:48:28.925-04:00 [04168 warning 'VpxProfiler' opID=a5dad99f] ClientAdapterBase::InvokeOnSoap: (esx098.server.company.com, vpxapi.VpxaService.searchDatastore) [SoapRpcTime] took 251114 ms

We can see the runtime of these API calls to be 90, 163, 129, 182, 193, 252, and 251 seconds respectively. This is a REALLY long time for a search and if this is holding vmdk or ctk files open this long then the retry time of the VMM process that runs the VM would expire and then VM would power off. Now we need to determine who is issuing these API calls.

Let’s review the vpxd-profiler logs on vCenter to figure out where these searches are originating from:

–> ThreadState/ThreadId/1668/State/TASK::session[ca010e84-bd1f-0493-018c-65a5389452c8]52e5e352-3b0f-9f06-96b2-bc8a7d49d548::datastore-3262::vim.host.DatastoreBrowser.searchSubFolders::ca010e84-bd1f-0493-018c-65a5389452c8(520b34e4-8a51-6
974-c44b-a0c67f1913a8)/State/RPC::vpxapi.VpxaService:vpxa::esx098.server.company.com::vpxapi.VpxaService.searchDatastore
–> ThreadState/ThreadId/3292/State/TASK::session[ca010e84-bd1f-0493-018c-65a5389452c8]52b7e9ee-a68d-f05a-569b-469c8f782287::datastore-2840::vim.host.DatastoreBrowser.searchSubFolders::ca010e84-bd1f-0493-018c-65a5389452c8(520b34e4-8a51-6
974-c44b-a0c67f1913a8)/State/RPC::vpxapi.VpxaService:vpxa::esx130.server.company.com::vpxapi.VpxaService.searchDatastore
–> ThreadState/ThreadId/3988/State/WAITING_VPX_MUTEX::EventManager::EventBatchAppender
–> ThreadState/ThreadId/5932/State/WAITING_VPX_MUTEX::QuickStatsResultQueue
–> ThreadState/ThreadId/7528/State/WAITING_VPX_MUTEX::QuickStatsResultQueue
–> ThreadState/ThreadId/7676/State/TASK::session[ca010e84-bd1f-0493-018c-65a5389452c8]5213f5de-1ade-5f03-1dff-c052e1128069::datastore-2896::vim.host.DatastoreBrowser.searchSubFolders::ca010e84-bd1f-0493-018c-65a5389452c8(520b34e4-8a51-6974-c44b-a0c67f1913a8)/State/RPC::vpxapi.VpxaService:vpxa::esx098.server.company.com::vpxapi.VpxaService.searchDatastore
–> ThreadState/ThreadId/7712/State/TASK::session[ca010e84-bd1f-0493-018c-65a5389452c8]52f10aba-19f4-effe-e45a-b01b7a46920a::datastore-2837::vim.host.DatastoreBrowser.searchSubFolders::ca010e84-bd1f-0493-018c-65a5389452c8(520b34e4-8a51-6974-c44b-a0c67f1913a8)/State/RPC::vpxapi.VpxaService:vpxa::esx098.server.company.com::vpxapi.VpxaService.searchDatastore
–> ThreadState/ThreadId/8296/State/TASK::session[ca010e84-bd1f-0493-018c-65a5389452c8]521f2165-f9a1-5da3-3aee-602d3252451c::datastore-2883::vim.host.DatastoreBrowser.searchSubFolders::ca010e84-bd1f-0493-018c-65a5389452c8(520b34e4-8a51-6974-c44b-a0c67f1913a8)/State/RPC::vpxapi.VpxaService:vpxa::esx098.server.company.com::vpxapi.VpxaService.searchDatastore
–> ThreadState/ThreadId/8572/State/TASK::task-35026::::vim.DiagnosticManager.generateLogBundles::371e3640-395c-c5a7-595b-dc21b56ca1b7(52a30ca3-828e-2d13-9b67-46f2b645b194)
–> ThreadState/ThreadId/8588/State/TASK::session[ca010e84-bd1f-0493-018c-65a5389452c8]529e290c-c0da-e02f-4c3c-73ecaa630531::datastore-2838::vim.host.DatastoreBrowser.searchSubFolders::ca010e84-bd1f-0493-018c-65a5389452c8(520b34e4-8a51-6974-c44b-a0c67f1913a8)/State/RPC::vpxapi.VpxaService:vpxa::esx098.server.company.com::vpxapi.VpxaService.searchDatastore
–> ThreadState/ThreadId/8680/State/WAITING_VPX_MUTEX::QuickStatsResultQueue
–> ThreadState/ThreadId/8688/State/TASK::session[ca010e84-bd1f-0493-018c-65a5389452c8]526bfb1a-e489-050e-8b25-302accd06a9c::datastore-2897::vim.host.DatastoreBrowser.searchSubFolders::ca010e84-bd1f-0493-018c-65a5389452c8(520b34e4-8a51-6974-c44b-a0c67f1913a8)/State/RPC::vpxapi.VpxaService:vpxa::esx130.server.company.com::vpxapi.VpxaService.searchDatastore
–> ThreadState/ThreadId/8720/State/TASK::session[ca010e84-bd1f-0493-018c-65a5389452c8]52289e72-a956-c1a3-24ac-c1e39193918e::datastore-2843::vim.host.DatastoreBrowser.searchSubFolders::ca010e84-bd1f-0493-018c-65a5389452c8(520b34e4-8a51-6974-c44b-a0c67f1913a8)/State/RPC::vpxapi.VpxaService:vpxa::esx130.server.company.com::vpxapi.VpxaService.searchDatastore
–> ThreadState/ThreadId/8760/State/TASK::session[ca010e84-bd1f-0493-018c-65a5389452c8]52aebf6b-afda-4b80-e748-f2584b83c047::datastore-2839::vim.host.DatastoreBrowser.searchSubFolders::ca010e84-bd1f-0493-018c-65a5389452c8(520b34e4-8a51-6974-c44b-a0c67f1913a8)/State/RPC::vpxapi.VpxaService:vpxa::esx130.server.company.com::vpxapi.VpxaService.searchDatastore
–> ThreadState/ThreadId/8828/State/TASK::session[ca010e84-bd1f-0493-018c-65a5389452c8]529e2bdf-686e-bf2e-5f38-abb50a9ffd6b::datastore-2841::vim.host.DatastoreBrowser.searchSubFolders::ca010e84-bd1f-0493-018c-65a5389452c8(520b34e4-8a51-6974-c44b-a0c67f1913a8)/State/RPC::vpxapi.VpxaService:vpxa::esx130.server.company.com::vpxapi.VpxaService.searchDatastore
–> ThreadState/ThreadId/8856/State/TASK::session[ca010e84-bd1f-0493-018c-65a5389452c8]52cb5829-f64f-4a8e-4d22-202682b8e2ab::datastore-2844::vim.host.DatastoreBrowser.searchSubFolders::ca010e84-bd1f-0493-018c-65a5389452c8(520b34e4-8a51-6974-c44b-a0c67f1913a8)/State/RPC::vpxapi.VpxaService:vpxa::esx130.server.company.com::vpxapi.VpxaService.searchDatastore
–> ThreadState/ThreadId/9072/State/TASK::task-internal-2::::VpxdDrmMainLoop::/State/WAITING_VPX_MUTEX::VpxdDrmMainLoopLRO
–> ThreadState/ThreadId/9096/State/TASK::task-internal-1::::VpxdDrmMainLoop::/State/WAITING_VPX_MUTEX::VpxdDrmMainLoopLRO
–> ThreadState/ThreadId/9168/State/TASK::session[ca010e84-bd1f-0493-018c-65a5389452c8]52987018-8a32-89de-8b86-44af61a2ab94::datastore-2842::vim.host.DatastoreBrowser.searchSubFolders::ca010e84-bd1f-0493-018c-65a5389452c8(520b34e4-8a51-6974-c44b-a0c67f1913a8)/State/RPC::vpxapi.VpxaService:vpxa::esx130.server.company.com::vpxapi.VpxaService.searchDatastore
–> ThreadState/ThreadId/9204/State/WAITING_VPX_MUTEX::MapListLRU
–> ThreadState/ThreadId/9420/State/TASK::task-internal-622550::host-552::VpxdInvtHostSyncHostLRO.Synchronize::/State/WAITING_VPX_MUTEX::TaskInfoPublisherLock
–> ThreadState/ThreadId/9432/State/TASK::session[ca010e84-bd1f-0493-018c-65a5389452c8]52a3861b-23b2-4c92-0126-8faac627b359::datastore-2922::vim.host.DatastoreBrowser.searchSubFolders::ca010e84-bd1f-0493-018c-65a5389452c8(520b34e4-8a51-6974-c44b-a0c67f1913a8)/State/RPC::vpxapi.VpxaService:vpxa::esx098.server.company.com::vpxapi.VpxaService.searchDatastore

These searches are originating from ID “ca010e84-bd1f-0493-018c-65a5389452c8″

–> /SessionStats/SessionPool/Id=’ca010e84-bd1f-0493-018c-65a5389452c8′/Username=’WDW\f-hpsitescope’/PropertyCollector/ComputeFilterUpdatesTime/max 0
–> /SessionStats/SessionPool/Id=’ca010e84-bd1f-0493-018c-65a5389452c8′/Username=’WDW\f-hpsitescope’/PropertyCollector/ComputeFilterUpdatesTime/mean 0
–> /SessionStats/SessionPool/Id=’ca010e84-bd1f-0493-018c-65a5389452c8′/Username=’WDW\f-hpsitescope’/PropertyCollector/ComputeFilterUpdatesTime/min 0
–> /SessionStats/SessionPool/Id=’ca010e84-bd1f-0493-018c-65a5389452c8′/Username=’WDW\f-hpsitescope’/PropertyCollector/ComputeFilterUpdatesTime/numSamples 0
–> /SessionStats/SessionPool/Id=’ca010e84-bd1f-0493-018c-65a5389452c8′/Username=’WDW\f-hpsitescope’/PropertyCollector/ComputeFilterUpdatesTime/total 0
–> /SessionStats/SessionPool/Id=’ca010e84-bd1f-0493-018c-65a5389452c8′/Username=’WDW\f-hpsitescope’/PropertyCollector/ComputeGUReqTime/max 0
–> /SessionStats/SessionPool/Id=’ca010e84-bd1f-0493-018c-65a5389452c8′/Username=’WDW\f-hpsitescope’/PropertyCollector/ComputeGUReqTime/mean 0
–> /SessionStats/SessionPool/Id=’ca010e84-bd1f-0493-018c-65a5389452c8′/Username=’WDW\f-hpsitescope’/PropertyCollector/ComputeGUReqTime/min 0
–> /SessionStats/SessionPool/Id=’ca010e84-bd1f-0493-018c-65a5389452c8′/Username=’WDW\f-hpsitescope’/PropertyCollector/ComputeGUReqTime/numSamples 0
–> /SessionStats/SessionPool/Id=’ca010e84-bd1f-0493-018c-65a5389452c8′/Username=’WDW\f-hpsitescope’/PropertyCollector/ComputeGUReqTime/total 0
–> /SessionStats/SessionPool/Id=’ca010e84-bd1f-0493-018c-65a5389452c8′/Username=’WDW\f-hpsitescope’/PropertyCollector/ComputeProvidersTime/max 0
–> /SessionStats/SessionPool/Id=’ca010e84-bd1f-0493-018c-65a5389452c8′/Username=’WDW\f-hpsitescope’/PropertyCollector/ComputeProvidersTime/mean 0
–> /SessionStats/SessionPool/Id=’ca010e84-bd1f-0493-018c-65a5389452c8′/Username=’WDW\f-hpsitescope’/PropertyCollector/ComputeProvidersTime/min 0
–> /SessionStats/SessionPool/Id=’ca010e84-bd1f-0493-018c-65a5389452c8′/Username=’WDW\f-hpsitescope’/PropertyCollector/ComputeProvidersTime/numSamples 0
–> /SessionStats/SessionPool/Id=’ca010e84-bd1f-0493-018c-65a5389452c8′/Username=’WDW\f-hpsitescope’/PropertyCollector/ComputeProvidersTime/total 0
–> /SessionStats/SessionPool/Id=’ca010e84-bd1f-0493-018c-65a5389452c8′/Username=’WDW\f-hpsitescope’/PropertyCollector/SetDoneTime/max 0
–> /SessionStats/SessionPool/Id=’ca010e84-bd1f-0493-018c-65a5389452c8′/Username=’WDW\f-hpsitescope’/PropertyCollector/SetDoneTime/mean 0
–> /SessionStats/SessionPool/Id=’ca010e84-bd1f-0493-018c-65a5389452c8′/Username=’WDW\f-hpsitescope’/PropertyCollector/SetDoneTime/min 0
–> /SessionStats/SessionPool/Id=’ca010e84-bd1f-0493-018c-65a5389452c8′/Username=’WDW\f-hpsitescope’/PropertyCollector/SetDoneTime/numSamples 0
–> /SessionStats/SessionPool/Id=’ca010e84-bd1f-0493-018c-65a5389452c8′/Username=’WDW\f-hpsitescope’/PropertyCollector/SetDoneTime/total 0

–> /SessionStats/SessionPool/Session/Id=’ca010e84-bd1f-0493-018c-65a5389452c8′/Username=’WDW\f-hpsitescope’/ClientIP=’10.1.1.4′/HttpSessionObject/Hidden/total 0
–> /SessionStats/SessionPool/Session/Id=’ca010e84-bd1f-0493-018c-65a5389452c8′/Username=’WDW\f-hpsitescope’/ClientIP=’10.1.1.4′/PropertyCollector/FilterCount/total 0
–> /SessionStats/SessionPool/Session/Id=’ca010e84-bd1f-0493-018c-65a5389452c8′/Username=’WDW\f-hpsitescope’/ClientIP=’10.1.1.4′/PropertyCollector/FullScans/total 0
–> /SessionStats/SessionPool/Session/Id=’ca010e84-bd1f-0493-018c-65a5389452c8′/Username=’WDW\f-hpsitescope’/ClientIP=’10.1.1.4′/PropertyCollector/IncrementalScans/total 0
–> /SessionStats/SessionPool/Session/Id=’ca010e84-bd1f-0493-018c-65a5389452c8′/Username=’WDW\f-hpsitescope’/ClientIP=’10.1.1.4′/PropertyCollector/LastVersionNum/total 0
–> /SessionStats/SessionPool/Session/Id=’ca010e84-bd1f-0493-018c-65a5389452c8′/Username=’WDW\f-hpsitescope’/ClientIP=’10.1.1.4′/PropertyCollector/NullCollectorCount/total 0
–> /SessionStats/SessionPool/Session/Id=’ca010e84-bd1f-0493-018c-65a5389452c8′/Username=’WDW\f-hpsitescope’/ClientIP=’10.1.1.4′/PropertyCollector/QueuedOpsCount/total 0
–> /SessionStats/SessionPool/Session/Id=’ca010e84-bd1f-0493-018c-65a5389452c8′/Username=’WDW\f-hpsitescope’/ClientIP=’10.1.1.4′/PropertyCollector/ReadLocked/total 0
–> /SessionStats/SessionPool/Session/Id=’ca010e84-bd1f-0493-018c-65a5389452c8′/Username=’WDW\f-hpsitescope’/ClientIP=’10.1.1.4′/PropertyCollector/TotalObjectCount/total 0
–> /SessionStats/SessionPool/Session/Id=’ca010e84-bd1f-0493-018c-65a5389452c8′/Username=’WDW\f-hpsitescope’/ClientIP=’10.1.1.4′/PropertyCollector/TriggeredFiltersCount/total 0
–> /SessionStats/SessionPool/Session/Id=’ca010e84-bd1f-0493-018c-65a5389452c8′/Username=’WDW\f-hpsitescope’/ClientIP=’10.1.1.4′/PropertyCollector/TriggeredProcessGUReqs/total 0

These searches are originating from user ‘f-hpsitescope’ on IP 10.1.1.4 and are effectively hammering vCenter with constant vpxapi.VpxaService.searchDatastore API requests. The next step would be to locate the server with this IP address and determine what piece of software is generating so many requests. Based on the name of the user it suggests that HP SiteScope is causing these requests. This may not necessarily be a problem with that product but instead how that product is configured. A useful troubleshooting step would be to disable this product to see if the behavior stops.

I hope this demonstration shows you an end-to-end approach on how to investigate these kinds of problems on your own.

Downgrading the hardware version of a VMware Fusion virtual machine

In this new video tutorial we demonstrate how to downgrade the virtual hardware version of a VMware Fusion virtual machine.

This process would be needed if you were evaluating VMware Fusion 7.x and after the evaluation period expired you wanted to downgrade to VMware Fusion 6.

For additional information see VMware Knowledge Base article Downgrading VMware Fusion 7.x to 6.x (2082112).

PCoIP issues with Horizon View Desktops

There are various reasons a user might experience a blank (black) screen when using the PCoIP protocol with Horizon View and our Support Engineers get calls on this every day. These include, but are not limited to:

  • Misconfiguration of connection server settings.
  • vRAM shortage on the View virtual machine.
  • Incorrect video driver version installed on the View virtual machine.

Note: Troubleshoot your PCoIP issues in the proper order

It is important that the steps outlined in KB article:
Troubleshooting a black screen when logging into a Horizon View virtual desktop using PCoIP (1028332) be followed in the order provided in order to quickly isolate and identify the proper resolution. They are also ordered in the most appropriate sequence to minimize data loss. The KB walks you through a serious of checks, verifying that various requirements are met and that your systems are configured properly. You will see many articles referenced as you work through each step. Do not skip steps. A slow, methodical approach works best here.

If you still see a problem after reading through the body of the KB article, there are also a number of related articles listed in the ‘See Also’ section. If you want to be notified when this article is updated there is an rss link you can subscribe to.

Announcement!

Many of you who read this blog also follow our social media accounts @vmwarekb, @vmwarecares, Facebook, and others. When we set out to engage with our customers online we wanted to be both proactive and reactive about it — that is, we would push out content like newly published KB articles, and we would answer customer questions. This has worked really well for us over the years and we thank all of you for helping making the ecosystem richer and a better place to work.

On the flip side, the number of channels tended to grow as new social networks came online. And grow. Eventually, what we ended up doing is confusing our customers as to which channel was for what purpose? We’re fixing that.

To stop this channel creep, over the next few weeks we will be consolidating some channels as follows:

Twitter:

Change: @vmwarekb is being retired. @vmwarecares will be the single voice going forward. We’re going to do this by renaming the vmwarekb account (as it has far more followers) to vmwarecares.

Action: If you currently follow vmwarekb you don’t need to do anything. You will still be subscribed to us, but the name will now be vmwarecares. If you follow vmwarecares currently you will need to re-follow us.

Facebook:

Change: VMware Knowledge Base is being retired and will go dormant. VMware Cares will be the single page going forward.

Action: Please ‘Like’ our VMware Cares.

By consolidating channels we hope to simplify and clarify the contact options to serve you better.

Top 20 Articles for September 2014

Here is our Top 20 KB list for September 2014. This list is ranked by the number of times a VMware Support Request was resolved by following the steps in a published Knowledge Base article.

1. VMware ESXi 5.x host experiences a purple diagnostic screen mentioning E1000PollRxRing and E1000DevRx (2059053)

2. VMware assessment of Bash Code Injection Vulnerability via Specially Crafted Environment Variables (CVE-2014-6271, CVE-2014-7169, CVE-2014-7186, CVE-2014-7187, CVE-2014-6277, CVE-2014-6278, aka “Shellshock”) (2090740)

3. Installing async drivers on VMware ESXi 5.0, 5.1, and 5.5 (2005205)

4. An ESXi 5.x host running on HP server fails with a purple diagnostic screen and the error: hpsa_update_scsi_devices or detect_controller_lockup_thread (2075978)

5. Purging old data from the database used by VMware vCenter Server 4.x and 5.x (1025914)

6. Manually deleting linked clones or stale virtual desktop entries from the View Composer database in VMware View Manager and Horizon View (2015112)

7. ESXi host cannot initiate vMotion or enable services and reports the error: Heap globalCartel-1 already at its maximum size.Cannot expand (2085618)

8. Re-pointing and re-registering VMware vCenter Server 5.1 / 5.5 and components (2033620)

9. Upgrading to vCenter Server 5.5 best practices (2053132)

10. Unmounting a LUN or detaching a datastore/storage device from multiple VMware ESXi 5.x hosts (2004605)

11. Determining Network/Storage firmware and driver version in ESXi/ESX 4.x and ESXi 5.x (1027206)

12. Installing Windows in a virtual machine using VMware Fusion Easy Install (1011677)

13. Investigating virtual machine file locks on ESXi/ESX (10051)

14. Restarting the Management agents on an ESXi or ESX host (1003490)

15. Troubleshooting Fusion virtual machine performance issues (1015676)

16. Installing VMware Tools in a Fusion virtual machine running Windows (1003417)

17. Resetting the VMware vCenter Server 5.x Inventory Service database (2042200)

18. The Host IPMI System Event Log Status alarm is triggered repeatedly in VMware vCenter Server 4.x and 5.x (1033725)

19. Installing or upgrading to ESXi 5.5 best practices (2052329)

20. Installing vCenter Server 5.5 best practices (2052334)

ALERT: Bash Code Injection Vulnerability aka Shellshock

VMware Support AlertOn Sept 24, 2014, a critical vulnerability in Bash (CVE-2014-6271, CVE-2014-7169) was published that may allow for remote code execution. The VMware Security Engineering, Communications, and Response group (vSECR) has been actively investigating the impact this vulnerability may have on our products.

For further information and updates on this vulnerability, refer to KB article:
VMware assessment of Bash Code Injection Vulnerability via Specially Crafted Environment Variables (CVE-2014-6271 CVE-2014-7169, aka “Shellshock”) (2090740)
.

Note: For information regarding VMware customer portals and web sites, see Impact of bash code injection vulnerability on VMware Customer Portals and web sites (CVE-2014-6271 and CVE-2014-7169, aka “shellshock”) (2090817).

Generating and Troubleshooting SSL certificates in View

VMware View SecurityNext up in our series of VMware View topics, we’re going to talk about security. I spoke with a couple of our top support engineers about View security and they identified three Knowledgebase articles that solve more support requests than any others in the area of security, namely SSL certificates.  They recommend customers use:

In View 5.1 and later, you configure certificates for View by importing the certificates into the Windows local computer certificate store on the View server host. By default, clients are presented with this certificate when they visit a secure page such as View Administrator. You can use the default certificate for lab environments, and one could even make the argument that it is OK for fire-walled environments, but otherwise you should replace it with your own certificate from a trusted CA (Verisign, GoDaddy, others) as soon as possible. They also told me you should use an SSL certificate from a trusted CA when setting up a Security Server for your environment when the Security Server can be used from outside your firewall (Internet) to access View desktops inside your firewall.

My engineers stressed to me the importance of following each step in these KBs one at a time when you are filling out the forms on those sites to obtain your certificate. It is easy to make a mistake and you might not receive something that will work for you.

Note: The default certificate is not signed by a commercial Certificate Authority (CA). Use of noncertified certificates can allow untrusted parties to intercept traffic by masquerading as your server.

 

Three Cool Features in VMware Fusion 7 Pro

Today we have three brand new videos which should be of particular interest to our VMware Fusion customers. These tutorials showcase some of the new features of VMware Fusion 7 Pro, and are based on the following Knowledge Base articles:

Visit these links to see the full step-by-step instructions and try out the new features yourselves!

When Linked Clones Go Stale

One of the biggest call drivers within our VMware View support centers revolves around linked clone pools. Some of your users may be calling you to report that their desktop is not available. You begin to check your vCenter and View Administrator portal and discover some of the following symptoms:

  • You cannot provision or recompose a linked clone desktop pool
  • You see the error:
    Desktop Composer Fault: Virtual Machine with Input Specification already exists
  • Provisioning a linked clone desktop pool fails with the error:
    Virtual machine with Input Specification already exists
  • The Connection Server shows that linked clone virtual machines are stuck in a deleting state
  • You cannot delete a pool from the View Administrator page
  • You are unable to delete linked clone virtual machines
  • When viewing a pools Inventory tab, the status of one or more virtual machines may be shown as missing

There are a number of reasons this might happen, and KB: 2015112 Manually deleting linked clones or stale virtual desktop entries from the View Composer database in VMware View Manager and Horizon View covers resolving this topic comprehensively, but let’s discuss a bit of the background around these issues.

When a linked clone pool is created or modified, several backend databases are updated with configuration data. First there is the SQL database supporting vCenter Server, next there is the View Composer database, and thirdly the ADAM database. Let’s also throw in Active Directory for good measure. With all of these pieces each playing a vital role in the environment, it becomes apparent that should things go wrong, there may be an inconsistency created between these databases. These inconsistencies can present themselves with the above symptoms.

Recently a new Fling was created to address these inconsistencies. If you’re not acquainted with Flings, they’re tools our engineers build to help you explore and manipulate your systems. However, it’s important to remember they come with a disclaimer:

“I have read and agree to the Technical Preview Agreement. I also understand that Flings are experimental and should not be run on production systems.”

If you’re just in your lab environment though, they are an excellent way to learn and understand the workings of your systems at a deeper level. Here is the Fling: ViewDbChk. For production systems we recommend following the tried and true procedures documented in KB 2015112. The KB includes embedded videos to help walk you through the steps.

How to upgrade from VMware Fusion 6.x to VMware Fusion 7.x

Today we have a new video tutorial which shows how to upgrade from VMware Fusion 6.x to VMware Fusion 7.x.

Note: Some things to be aware of before you attempt your upgrade

  • Ensure you shut down any virtual machine before performing the upgrade. If any virtual machines are in a suspended state, resume them from the suspended state and perform a shutdown operation from the guest operating system. Ensure that their status reads Powered Off.
  • Upgrading VMware Fusion 6.x to 7.x does not affect the contents of your virtual machines. All virtual machines and the data contained in them are safely retained.
  • Installing VMware Fusion 7.x automatically removes VMware Fusion 6.x from your Mac.

Complete upgrading steps can be found in VMware Knowledge Base article: Upgrading from VMware Fusion 6.x to 7.x (2081993).