Virtualizing XenApp on XenServer 5.0 and ESX 3.5

There has always been interest in running Citrix XenApp (formerly Citrix Presentation Server) workloads on the VMware Virtual Infrastructure platform. With the advent of multi-core systems, purchasing decisions are driven towards systems with 4-16 cores. However, using this hardware effectively is difficult due to limited scaling of the XenApp application environment. In addition to the usual benefits of virtualization, these scaling issues make running XenApp environments on ESX even more compelling.

We
recently ran some performance tests to understand what can be expected
in terms of performance for a virtualized XenApp workload. The results
show that ESX runs common desktop applications on XenApp with
reasonable overhead compared to a native installation, and with
significantly better performance than XenServer. We hope this data will
help provide guidance when XenApp environments are transitioned from physical hardware to a virtualized environment.

Together with partners, we have been developing a desktop workload for over a year. The workload has been tested extensively on virtual desktop infrastructure (VDI) environments with one user per virtual machine (VM). VDI results have been presented and published in numerous locations (e.g., http://www.vmware.com/resources/techresources/1085, VMworld 2008 presentation VD2505 with Dell-EqualLogic). Great attention was paid to selecting the most relevant applications as well as to specifying the right types and amount of work each should do. Many other Terminal Services-style benchmarks fail to be representative of actual desktop users. Porting the workload from a VDI environment to the XenApp environment was straightforward.

XenApp was run in a single 14 GB 2-vCPU Virtual Machine (VM) booted with Windows Server 2003 x64. The hypervisors used were ESX 3.5 U3 and XenServer 5. The VMs for both had the appropriate tools/drivers installed. The XenServer VM had the Citrix XenApp optimization enabled. For comparison, the tests were run natively with the OS restricted to the same hardware resources. The hardware is a HP DL585 with 4 quad-core 2210 MHz “Barcelona” processors and 64 GB memory. Rapid Virtualization Indexing (RVI) was enabled.

The test consists of 22 operations, always executed in the following order:

IE_OPEN_2	Open Internet Explorer
IE_ALBUM	Browse photos in IE
EXCEL_OPEN_2	Open Excel file
EXCEL_FORMULA	Evaluate formula in Excel
EXCEL_SAVE_2	Save Excel file
FIREFOX_OPEN	Open Firefox
FIREFOX_CLOSE	Close Firefox
ACROBAT_OPEN_1	Open PDF file
ACROBAT_BROWSE_1	Browse PDF file
PPT_OPEN	Open PowerPoint file
PPT_SLIDESHOW	Slideshow in PowerPoint
PPT_EDIT	Edit PowerPoint file
PPT_APPEND	Append to PowerPoint file
PPT_SAVE	Save PowerPoint file
WORD_OPEN_1	Open Word file
WORD_MODIFY_1	Modify Word file
WORD_SAVE_1	Save Word file
IE_OPEN_1	Open Internet Explorer
IE_APACHE	Browse Apache doc in IE
EXCEL_OPEN_1	Open Excel file
EXCEL_SORT	Sort column in Excel
EXCEL_SAVE_1	Save Excel file

A “sleep” of random length is inserted between each operation to simulate user think time. One execution of the whole set of operations is called an “iteration” and takes about 57 minutes. Several of these operations consist of many sub-operations. For instance, the PPT_SLIDESHOW operation consists of 10 sub-operations where each displays a slide in a PowerPoint document followed by a pause. Only the latency to display the slide is timed, and not the time spent sleeping. The latencies of the sub-operations are summed to give the operation latency, and all the operation latencies within one iteration of one user are summed to yield the “total latency”. AutoIt3, an open-source scripting language, is used on the server side to automate the operations. CSTK Client Launcher (a utility that allows the tester to create and launch multiple ICA client sessions) is used on a client machine to start the users (sessions). Each user is started in a staggered fashion so that the last user is starting when the first user is close to finishing its first iteration. This strategy avoids synchronizing the execution of any operation across users. Each user runs six iterations. The “average total latency” is determined by averaging all the total latencies across the middle four iterations (i.e., the ones where all users are running at steady state), and across all users. Note that it is important to time many different kinds of desktop applications: timing just a few operations (or even just one as has been done in other publications) can give a very distorted view of overall performance. With a similar philosophy we gather CPU data over nearly four hours of steady state to ensure the utilization statistics are solid. The first figure shows the average total latency as a function of the number of users for XenServer, ESX, and Native.

The two horizontal lines labeled “QoS” denote the Native latency for 35 and 38 users. Either of these may be considered as a reasonable maximum Quality of Service for latency. They correspond to somewhat less or more, respectively, of half of the available CPU resources (see the CPU figure below), which is a commonly used target for XenApp. At higher utilizations not only does the latency increase rapidly but operations may start to fail. We required that all operations succeed (just like a real user expects!) for a test to be deemed successful. The points where the QoS lines cross the ESX and XenServer curves gives the number of users that can be supported with the same total latency. Normalizing with the number of Native users (35 or 38) gives the fraction of Native users each virtualization product can support at the given total latency:

ESX consistently supports about 86% of the native number of users, while XenServer supports about 77%. Shown below is the average CPU utilization during the second to fifth iteration of the last user, given as a percentage of a single core. Perfmon was used for Native, esxtop for ESX, and xentop for XenServer. ESX uses less CPU than XenServer no matter how the comparison is made: for a given number of users, or for a given total latency:

XenApp and other products that virtualize applications are prime candidates to be run in a VM. These results show that ESX can do so efficiently compared to using a physical machine. This was shown with a benchmark that: represents a real desktop workload, uses a metric that includes latencies of all operations, and requires that all operations complete successfully. Furthermore, ESX supports about 13% more users than XenServer at a given latency while using less CPU.