Scale-out of XenApp on ESX 3.5

In an earlier posting (Virtualizing XenApp on XenServer 5.0 and ESX 3.5) we looked at the performance of virtualizing a Citrix XenApp workload in a 2-vCPU VM in comparison to the native OS booted with two cores. This provided valuable data about the single-VM performance of XenApp running on ESX 3.5. In our next set of experiments we used the same workload, and the same hardware, but scaled out to 8 VMs. This is compared to the native OS booted with all 16 cores. We found that ESX has near-linear scaling as the number of VMs is increased, and that aggregate performance with 8 VMs is much better than native.

We expected the earlier single-VM approach to produce representative results because of the excellent scale-out performance of ESX 3.5. This is especially true on NUMA architectures where VMs running on different nodes are nearly independent in terms of CPU and memory resources. However, the same cannot be said for the scale-up performance (SMP scaling) of a single native machine, or a single VM. As for many other applications, virtualizing many relatively small XenApp servers on a single machine can overcome the inherent SMP performance limitations of XenApp on the same machine.

In the current experiments, each VM is the same as before, except the allocated memory is set to 6700 MB (the amount needed to run 30 users). Windows 2003 x64 was used in both the VMs and natively. See the above posting for more workload and configuration details. Shown below is the average aggregate latency as a function of the total number of users. Every data point shown is a separate run with about 4 hours of steady state execution. Each user performs six iterations where a complete set of 22 workload operations is performed during each iteration. The latency of these operations is summed to get the aggregate latency. The average is over the middle four iterations, all the users, and all the VMs.

In both the Native and ESX cases all 16 cores are being used (although with much less than 100% utilization). At very low load Native has somewhat better total latency, but beyond 80 users the latency quickly degrades. Starting at 140 users some of the sessions start to fail. 120 users is really the upper limit for running this workload on Native. With 8 VMs on ESX, 20 users per VM (160 total) was not a problem at all, so we pushed the load up to 240 total users. At this point the latency is getting high, but there were no failures and all of the desktop applications were still usable. The load has to be increased to more than 200 users on ESX before the latency exceeds that from 120 users on Native. That is, for a Quality-of-Service standard of 39 seconds aggregate latency, ESX supports 67% more users than Native. Like many commonly-deployed applications, XenApp has limited SMP scalability. Roughly speaking, its scalability is better than common web apps but not as good as well-tuned databases. When running this workload, XenApp scales well to 4 CPUs, but 8 CPUs is marginal and 16 CPUs is clearly too many. Dividing the load among smaller VMs avoids SMP scaling issues and allows the full capabilities of the hardware to be utilized.

Some would say that even 200 XenApp users are not very many for such a powerful machine. In any benchmark of this kind many decisions have to be made with regard to the choice of applications, operations within each application, and amount of “user think time” between operations. As we pointed out earlier, we strove to make realistic choices when designing the VDI workload. However, one may choose to model users performing less resource-intensive operations and thus be able to support more of them.

The scale-out performance of ESX is quantified in the second chart, which shows the total latency as a function of the number of VMs with all VMs running either a low (10 users), medium (20), or high (30) load each. Flat lines would indicate perfect scalability. They are actually nearly so for the each of the load cases up to 4 VMs. The latency increases noticeably only for 8 VMs, and then only for higher loads. This indicates that the increased application latency is mostly due to the increased memory latency caused by running 2 VMs per NUMA node (as opposed to at most a single VM per node for four or fewer VMs).

While our first blog showed how low the overhead is for running XenApp on a single 2-vCPU VM on ESX compared to a native OS booted with two CPUs, the current results fully utilizing the 16 core machine are even more compelling. These show the excellent scale-out performance of ESX on a modern NUMA machine, and that the aggregate performance of several VMs can far exceed the capabilities of a single native OS.