VMware Horizon

No More Crying Wolf: Troubleshooting Performance Issues in a VMware Horizon View Virtual Desktop Implementation

By Cindy Heyer, Contract Technical Writer, Technical Marketing, End-User Computing at VMware; with significant contributions from David Wooten, Product Manager for vCenter Operations Manager for Horizon View, VMware

Nothing is more deadly to a network administrator than a boy who cries wolf. He’s the kid who gets bored tending the family sheep, and thinks it’s funny to yell, “Wolf! Wolf!” Then he gets his kicks watching people falling all over themselves to save a flock of sheep that aren’t in any danger. Crying wolf is a kind of “false positive,” similar to a car alarm that can be triggered by the slightest gust of wind. False positives are bad news because when people are exposed to an alarm too often, they learn to dismiss it. Then when a real wolf makes an entrance, they ignore the cries for help.

vmware-vcenter-operations-manager-horizon-view-boy-crying-wolf

False positives are one of the biggest concerns of network administrators and help desk personnel of large-scale Horizon View deployments who monitor performance issues between connected users. What these administrators need is a network-monitoring tool that will sound the alarm when there’s a serious problem – but not cry wolf. The thing is, lots of tools capture raw data and alert on events large and small. But VMware vCenter Operations Manager for Horizon View (vC Ops for View) squeezes every last bit of operational value out of that data, and tells you the important problems that need your attention. VMware vC Ops for View is an extension of Enterprise vCenter Operations Manager and is uniquely equipped to address performance issues in Horizon View virtual desktop deployments.

The use of dynamic thresholds is one feature that sets VMware vCenter Operations Manager apart. Dynamic thresholds are based on a set of baseline behaviors. VMware vCenter Operations Manager constantly monitors its resources, tracks any behavior that deviates from these baselines, and applies these self-learning dynamic thresholds to adapt to evolving conditions without triggering unnecessary alarms. This can significantly reduce the number of false positive alerts and alarms that more traditional monitoring solutions produce when they use hard thresholds. Most traditional monitoring tools react to spikes or transients by firing an alert, which then causes the administrator to react. Reducing the number of false positives dramatically reduces the time administrators spend troubleshooting.

Administrative personnel can take a proactive approach and correct issues before they impact or degrade end-user performance. Root-cause ranking provides details on exceptions to expected ranges of operation. These expected ranges or dynamic thresholds are created by the vCenter Operations Manager advanced analytics engine and are adjusted on a constant basis.

VMware vCenter Operations Manager for Horizon View can quickly allow diagnosis of problems before they impact the end-user Horizon View experience. Take a look at the following screenshot, where multiple issues are bombarding Windows 7 desktop users connected to Horizon View.

vmware-vcenter-operations-manager-horizon-view-troubleshooting-3

In this window, the administrator selects a Horizon View desktop from the heat map, which reveals that the VM Overall and VM Memory object metrics in the Object Metrics section are at 100 percent.

vmware-vcenter-operations-manager-horizon-view-troubleshooting-3a

Happily, it is clear that the Parent Resources, which include the ESXi host and the Horizon View pool where the desktop is hosted, are performing within normal limits.

vmware-vcenter-operations-manager-horizon-view-troubleshooting-3b

Taking a look at the Child Resources, the administrator notices that a datastore is not performing within normal limits. Further analysis could reveal that the virtual machine workload requires more memory and possibly CPU. The datastore that is hosting these virtual desktops could be over-subscribed, which would cause the virtual machines to exhibit a higher-than-normal workload.

vmware-vcenter-operations-manager-horizon-view-troubleshooting-3c

The administrator returns to the View Main dashboard, and within the View Alerts area, sees that alerts are fired on several connected Horizon View users. (Some metrics can also be collected for users who are logged in but disconnected, but this example pertains to connected sessions only.)

vmware-vcenter-operations-manager-horizon-view-main-dashboard

Clicking the alert displays the reason for the alert, as well as the resource exhibiting the problem. Mousing over the More balloon reveals details about the resource, including the type of adapter collecting the data, the collection interval, and the type of machine.

vmware-vcenter-operations-manager-horizon-view-dashboard-a

In the Impact section of the Alert Summary window, the current health and time are displayed. The health for this connection is currently 25 percent, which indicates that the connected user is experiencing a negative impact on performance.

vmware-vcenter-operations-manager-horizon-view-troubleshooting-1

The health score is a high-level indicator, similar to the Check Engine light in your car. The health score acts as a green or red flag where you can quickly ascertain that all is well, or that it is time to check deeper.

Clicking the Troubleshoot button displays the Resource Detail window. In this example, the root-cause ranking for the connected user indicates that transmit packet loss is occurring in the Horizon View PCoIP network.

Root-cause analysis is a method of problem resolution that starts by identifying the underlying factors that contributed to the problem. Root-cause analysis is based on the assumption that correcting root causes is usually more effective than addressing symptoms. You can try to reduce a person’s fever, for example, but it will keep rising again until you resolve the infection causing it.

One root cause for a low health score for connection can be a high latency value, otherwise known as noise. Noise can be caused by high packet loss and retry rates. Another root cause can be high resource-use combined with high wait times. A high rate of CPU usage that is within dynamic thresholds may not necessarily spell a problem. However, high CPU usage combined with high resource wait times, such as for CPU or I/O, may be indicative of an overloaded resource.

vmware-vcenter-operations-manager-horizon-view-troubleshooting-2

Next, the administrator turns to the Metric Selector. The yellow masking indicates that there is an anomaly in the PCoIP metric. If you expand the metric, you will see that the Transmit Packet Loss Percent metric is outside the expected range of behavior.

When you see how effective vCenter Operations Manager for Horizon View can be to troubleshoot connected-user performance in a VMware Horizon View deployment, you’ll want to know more. See the vC Ops for View Deployment Guide for details about the product and its main components, and additional use cases where the monitoring and troubleshooting tool is useful.

Note: The new vC Ops for View Deployment Guide provides scenarios you typically encounter as you maintain and monitor your Horizon View deployment. This guide shares helpful tips for troubleshooting your Horizon View instance, getting the most out of vC Ops for View, and using it to monitor and troubleshoot your network infrastructure. Additional use-case scenarios will be discussed when future versions of this document are released. If you want to add your input to our list of topics to cover, comment on this blog post.

You can find out more:

To tweet about this blog, contact the VMware End-User Computing Solutions Management and Technical Marketing team at twitter.com/vmwareeucsmtm.