Home > Blogs > VMware VROOM! Blog > Author Archives: Jeffrey Buell

Author Archives: Jeffrey Buell

Virtualized Hadoop Performance with vSphere 6

A recently published whitepaper shows that not only can vSphere 6 keep up with newer high-performance servers, it thrives on their capabilities.

Two years ago, Hadoop benchmarks were run with vSphere 5.1 on a cluster of 32 dual-socket, quad-core servers. Very good performance was demonstrated, with the optimal virtualized configuration shown to be actually 2% faster than native for TeraSort (see the previous whitepaper).

These benchmarks were recently run on a cluster of the same size, but with ten-core processors, more disks and memory, dual 10GbE networking, and vSphere 6. The maximum dataset size was almost quadrupled to 30TB, to ensure that it is much bigger than the total memory in the cluster (hence qualifying the test as Big Data, by one definition).

The results, summarized in the chart below, show that the optimal virtualized configuration now delivers 12% better performance than native for TeraSort. The primary reason for this excellent performance is the ability of vSphere to map physical hardware resources to virtual hardware that is optimized for scale-out applications. The observed trend, as well as theory based on processor characteristics, indicates that the importance of being able to do this mapping correctly increases as processors become more powerful. The sub-optimal performance of one of the tests is due to the combination of very small VMs and how Hadoop does replication during data creation. On the other hand, small VMs are very advantageous for read-dominated applications, which are typically more common. Taken together with other best practices discussed in the paper, this information can be used to configure Hadoop clusters for the highest levels of performance. Despite all the hardware and software changes over the past two years, the optimal configuration was still found to be four VMs per dual-socket host.

elapsed_time_ratioPlease take a look at the whitepaper for more details on how these benchmarks were run and for analyses on why certain virtual configurations perform so well.

Virtualized Hadoop Performance with vSphere 5.1

In an earlier paper on a small seven-host cluster it was shown that Hadoop can be virtualized with little overhead, and that better-than-native performance can be achieved with the right configuration. However, the reasons for the observed performance behavior were not well understood. Recently, this work was refreshed with a larger cluster of 32 high-performance hosts running VMware vSphere® 5.1. The performance of native and several virtual configurations was compared for three applications. The apples-to-apples case of a single virtual machine per host shows performance close to that of native. Improvements in elapsed time of up to 13% for the most important application (TeraSort) can be achieved by partitioning each host into two or four virtual machines, resulting in competitive or even better than native performance as shown in the figure below (number of VMs is per host, and a lower ratio is better). Details of the results are in a new whitepaper: “Virtualized Hadoop Performance with VMware vSphere 5.1“. The paper also discusses the use of several performance tools and models to gain a better understanding of both the sources of virtualization overhead and the reasons why configuring multiple smaller virtual machines per host can enhance performance. Based on this, recommendations for optimal hardware and software configuration are also given.

Protecting Hadoop with VMware vSphere 5 Fault Tolerance

Hadoop provides a platform for building distributed systems for massive data storage and analysis. It has internal mechanisms to replicate user data and to tolerate many kinds of hardware and software failures. However, like many other distributed systems, Hadoop has a small number of Single Points of Failure (SPoFs). These include the NameNode (which manages the Hadoop Distributed Filesystem namespace and keeps track of storage blocks), and the JobTracker (which schedules jobs and the map and reduce tasks that make up each job). VMware vSphere Fault Tolerance (FT) can be used to protect virtual machines that run these vulnerable components of a Hadoop cluster. Recently a cluster of 24 hosts was used to run three different Hadoop applications to show that such protection has only a small impact on application performance. Various Hadoop configurations were employed to artificially create greater load on the NameNode and JobTracker daemons. With conservative extrapolation, these tests show that uniprocessor virtual machines with FT enabled are sufficient to run the master daemons for clusters of more than 200 hosts.

A new white paper, “Protecting Hadoop with VMware vSphere 5 Fault Tolerance,” is now available in which these tests are described in detail. CPU and network utilization of the protected VMs are given to enable comparisons with other distributed applications. In addition, several best practices are suggested to maximize the size of the Hadoop cluster that can be protected with FT.

Virtualized Hadoop Performance on vSphere 5

In recent years the amount of data stored worldwide has exploded. This has led to the birth of the term 'Big Data'. While the scale of data brings with it complexity associated with storing and handling it, these large datasets are known to have business information buried in them that is critical to continued growth and success.  The last few years have seen the birth of several new tools to manage and analyze such large datasets in a timely way (where traditional tools have had limitations). A natural question to ask is how these tools perform on vSphere. As the start of an ongoing effort to qauntify the performance of big data tools on vSphere, we've chosen to test one of the more popular tools – Hadoop.

Hadoop has emerged as a popular platform for the distributed processing of data. It scales to thousands of nodes while maintaining resiliency to disk, node, or even rack failure. It can use any storage, but is most often used with local disks. A whitepaper giving an overview of Hadoop and the details of tests on commondity hardware with local storage is available here.  One of the findings in the paper is that running 2 or 4 smaller VMs per physical machine usually resulted in better performance, often exceeding native performance.

As we continue our performance testing, stay tuned for results on a larger cluster with bigger data, with other Big Data tools, and on shared storage.

HPC Application Performance on ESX 4.1: Memory Virtualization

This is the third part in an ongoing series on exploring performance issues of virtualizing HPC applications. In the first part, we described the setup and considered pure memory bandwidth using Stream. The second part considered the effect of network latency in a scientific application (NAMD) that ran across several virtual machines.  Here we look at two of the tests in the HPC Challenge (HPCC) suite:  StarRandomAccess and HPL. While certainly not spanning all possible memory access patterns found in HPC apps, these two tests are very different from each other and should help to give bounds on virtualization overhead related to these patterns.

Virtualization adds indirection to memory page table mappings: in addition to the logical-to-physical mappings maintained by the OS (either native or in a VM), the hypervisor must maintain guest physical-to-machine mappings. A straightforward implementation of both mappings in software would result in enormous overhead. Prior to the introduction of hardware MMU features in Intel (EPT) and AMD (RVI) processors, the performance problem was solved through the use of “shadow” page tables. These collapsed the two mappings to one so that the processor TLB cache could be used efficiently; however, updating shadow page tables is expensive. With EPT and RVI, both mappings are cached in the TLB, eliminating the need for shadow page tables. The trade-off is that a TLB miss can be expensive: the cost is not just double the cost of a miss in a conventional TLB; it is the square of the number of steps in the TLB page walk. This cost can be reduced by using large memory pages (2MB in x86_64) which typically need four steps in the TLB, rather than small pages (4KB) which need five. This overview is highly simplified; see the performance RVI and EPT whitepapers for much more detail about MMU virtualization, as well as results from several benchmarks representing enterprise applications. Here we extend the EPT paper to HPC apps running on a current version of vSphere.

Although there are certainly exceptions, two memory characteristics are common to HPC applications: a general lack of page table manipulation, and heavy use of memory itself. Memory is allocated once (along with the associated page tables) and used for a long time. This use can either be dominated by sequential accesses (running through an array), or by random accesses. The latter will put more stress on the TLB. Common enterprise apps are often the opposite: much heavier page table activity but lighter memory usage. Thus HPC apps do not benefit much from the elimination of shadow page tables (this alone made many enterprise apps run close to native performance as shown in the above papers), but may be sensitive to the costs of TLB misses.

These points are illustrated by two tests from the HPCC suite. StarRandomAccess is a relatively simple microbenchmark that continuously accesses random memory addresses. HPL is a standard floating-point linear algebra benchmark that accesses memory more sequentially. For these tests, version 1.4.1 of HPCC was used on RHEL 5.5 x86_64. Hyper-threading was disabled in the BIOS and all work was limited to a single socket (automatically in the virtual cases and forced with numactl for native). In this way, the effects of differences between native and virtual in how HT and NUMA are treated were eliminated. For virtual, a 4-vCPU VM with 22GB was used on a lab version (build 294208) of ESX 4.1.  The relevant HPCC parameters are N=40000, NB=100, P=Q=2, and np=4. These values ensure all the CPU resources and nearly all the available memory of one socket was consumed, thereby minimizing memory cache effects. The hardware is the same as in the first part of this series. In particular, Xeon X5570 processors with EPT are used.

Throughput results for StarRandomAccess are shown in Table 1. The metric GUP/s is billions of updates per second, a measure of memory bandwidth. Small/large pages refers to memory allocation in the OS and application. For virtual, ESX always backs guest memory with large pages, if possible (as it is here). The default case (EPT enabled, small pages in the guest) achieves only about 85% of native throughput.  For an application with essentially no I/O or privileged instructions that require special handling by the hypervisor, this is surprisingly poor at first glance. However, this is a direct result of the hardware architecture needed to avoid shadow page tables. Disabling EPT results in near-native performance because, now, the TLB costs are essentially the same as for native and the software MMU costs are minimal. TLB costs are still substantial as seen by the effect of using large pages in native and the guest OS: more than doubling of the performance. The virtualization overhead is reduced to manageable levels, although there is still a 2% benefit from disabling EPT.

Table 1.  StarRandomAccess throughput, GUP/s (ratio to native)

  Native Virtual
EPT on EPT off
Small pages 0.01842 0.01561 (0.848) 0.01811 (0.983)
Large pages 0.03956 0.03805 (0.962) 0.03900 (0.986)

Table 2 shows throughput results for HPL. The metric Gflops/s is billions of floating point operations per second. Memory is largely accessed sequentially, greatly reducing the stress on the TLB and the effect of large pages. Large pages improve virtual performance by 4%, but improve native performance by less than 2%. Disabling EPT improves virtual performance by only 0.5%. It is not clear why virtual is slightly faster than native in the large pages case; this will be investigated further.

Table 2.  HPL throughput, Gflop/s (ratio to native)

  Native Virtual
EPT on EPT off
Small pages 37.04 36.04 (0.973) 36.22 (0.978)
Large pages 37.74 38.24 (1.013) 38.42 (1.018)

While hardware MMU virtualization with Intel EPT and AMD RVI has been a huge benefit for many applications, these test results support the expectation that the benefit for HPC apps is smaller, and can even increase overhead in some cases. However, the example shown here where the latter is true is a microbenchmark that focuses on the worst case for this technology. Most HPC apps will not have so many random memory accesses, so the effect of EPT is likely to be small.


HPC Application Performance on ESX 4.1: NAMD

This is the second part in an on-going series on exploring performance issues of virtualizing HPC applications. In the first part we described the setup and considered memory bandwidth. Here we look at network latency in the context of a single application. Evaluating the effect of network latency in general is far more difficult since HPC apps range from ones needing micro-second latency to embarrassingly-parallel apps that work well on slow networks. NAMD is a molecular dynamics code that is definitely not embarrassingly parallel but is known to run fine over 1 GbE TCP/IP, at least for small clusters. As such it represents the network requirements of an important class of HPC apps.

NAMD is a molecular dynamics application used to investigate the properties of large molecules. NAMD supports both shared memory parallelism and multiple-machine parallelism using TCP/IP. The native results use up to 16 processes on a single machine (“local” mode). Future work will use multiple machines, but some idea of the performance issues involved can be obtained by running multiple VMs in various configurations on the same physical host. The benchmark consists of running 500 steps of the Satellite Tobacco Mosaic Virus. “STMV” consists of slightly over 1 million atoms, which is large enough to enable good scaling on fairly large clusters. Shown below are elapsed time measurements for various configurations. Each is an average of 3 runs and the repeatability is good. The virtual NIC is e1000 for all the virtualized cases.

An apples-to-apples comparison between native and virtual is obtained by disabling HT and using a single 8-vCPU VM. The VM is configured with 12GB and default ESX parameters are used. With no networking, the virtual overhead is just 1% as shown in Table 1.

Table 1. NAMD elapsed time, STMV molecule, HT disabled

Total Processes Native Virtual
4 1748 1768
8 915 926

The effect of splitting the application across multiple machines and using different network configurations can be tested in a virtual environment. For these tests HT is enabled to get the full performance of the machine. The single VM case is configured as above. The 2-VM cases are configured with 12GB, 8 vCPUs, and preferHT=1 (so each VM can be scheduled on a NUMA node). The 4-VM cases have 6GB, 4 vCPUs, and preferHT=0. When multiple VMs communicate using the same vSwitch, ESX handles all the traffic in memory. For the multiple vSwitch cases, each vSwitch is associated with a physical NIC which is connected to a physical switch. Since all networking traffic must go through this switch, this configuration will be the same as using multiple hosts in terms of inter-VM communication latencies. An overview of vSwitches and networking in ESX is available here.

Table 2. NAMD elapsed time, STMV molecule, HT enabled

Total Processes 4 8 16
Native 1761 1020 796
1 VM 1766 923 N/A
2 VMs, 1 vSwitch 1779 928 787
2 VMs, 2 vSwitches 1800 965 806
4 VMs, 1 vSwitch 1774 940 810
4 VMs, 4 vSwitches 1885 1113 903

The single VM case shows that HT has little effect on ESX performance when the extra logical processors are not used. However, HT does slow down the native 8 process case significantly. This appears to be due to Linux not scheduling one process per core when it has the opportunity, which the ESX scheduler does by default. Scalability from 4 to 8 processes for the single vSwitch cases is close to 1.9X, and from 8 to 16 processes (using the same number of cores, but taking advantage of HT) it is 1.17X. This is excellent scaling. Networking over the switch reduces the performance somewhat, especially for four vSwitches. Scaling for native is hurt because the application does not manage NUMA resources itself, and Linux is limited by how well it can do this. This allows one of the 16-process virtualized cases to be slightly faster than native, despite the virtualization and multiple-machine overheads. The 16-process cases have the best absolute performance, and therefore correspond to how NAMD would actually be configured in practice. Here, the performance of all the virtualized cases is very close to native, except for the 4-vSwitch case where the extra overhead of networking has a significant effect. This is expected and should not be compared to the native case since the virtual case models four hosts. We plan to investigate multiple-host scaling soon to enable a direct comparison. A useful simulation needs up to 10 million steps, which would only be practical on a large cluster and only if all the software components scale very well.

These tests show that a commonly-used molecular dynamics application can be virtualized on a single host with little or no overhead. This particular app is representative of HPC workloads with moderate networking requirements. Simulating four separate hosts by forcing networking to go outside the box causes a slowdown of about 12%, but it is likely the corresponding native test will see some slowdown as well. We plan to expand the testing to multiple hosts and to continue to search for workloads that test the boundaries of what is possible in a virtualized environment.

Next: Memory

HPC Application Performance on ESX 4.1: Stream

Recently VMware has seen increased interest in migrating High Performance Computing (HPC) applications to virtualized environments. This is due to the many advantages virtualization brings to HPC, including consolidation, support for heterogeneous OSes, ease of application development, security, job migration, and cloud computing (all described here). Currently some subset of HPC applications virtualize well from a performance perspective. Our long-term goal is to extend this to all HPC apps, realizing that large-scale apps with the lowest latency and highest bandwidth requirements will be the most challenging. Users who run HPC apps are traditionally very sensitive to performance overhead, so it is important to quantify the performance cost of virtualization and properly weigh it against the advantages. Compared to commercial apps (databases, web servers, and so on), which are VMware’s bread-and-butter, HPC apps place their own set of requirements on the platform (OS/hypervisor/hardware) in order to execute well. Two common ones are low-latency networking (since a single app is often spread across a cluster of machines) and high memory bandwidth. This article is the first in a series that will explore these and other aspects of HPC performance. Our goal will always be to determine what works, what doesn’t, and how to get more of the former. The benchmark reported on here is Stream, which is a standard tool designed to measure memory bandwidth. It is a “worst case” micro-benchmark; real applications will not achieve higher memory bandwidth.


All tests were performed on an HP DL380 with two Intel X5570 processors, 48 GB memory (12 × 4 GB DIMMs), and four 1-GbE NICs (Intel Pro/1000 PT Quad Port Server Adapter) connected to a switch. Guest and native OS is RHEL 5.5 x86_64. Hyper-threading is enabled in the BIOS, so 16 logical processors are available. Processors and memory are split between two NUMA nodes. A pre-GA lab version of ESX 4.1 was used, build 254859.

Test Results

The OpenMP version of Stream is used. It is built using a compiler switch as follows:

gcc -O2 -fopenmp stream.c -o stream

The number of simultaneous threads is controlled by an environment variable:


The array size (N) and number of iterations (NTIMES) are hard-wired in the code as N=108 (for a single machine) and NTIMES=40. The large array size ensures that the processor cache provides little or no benefit. Stream reports maximum memory bandwidth performance in MB/sec for four tests: copy, scale, add, and triad (see the above link for descriptions of these). M stands for 1 million, not 220. Here are the native results, as a function of the number of threads:

Table 1. Native memory bandwidth, MB/s

Threads 1 2 4 8 16
Copy 6388 12163 20473 26957 26312
Scalar 5231 10068 17208 25932 26530
Add 7070 13274 21481 29081 29622
Triad 6617 12505 21058 29328 29889

Note that the scaling starts to fall off after two threads and the memory links are essentially saturated at 8 threads. This is one reason why HPC apps often do not see much benefit from enabling Hyper-Threading. To achieve the maximum aggregate memory bandwidth in a virtualized environment, two virtual machines (VMs) with 8 vCPUs each were used. This is appropriate only for modeling apps that can be split across multiple machines. One instance of stream with N=5×107 was run in each VM simultaneously so the total amount of memory accessed was the same as in the native test. The advanced configuration option preferHT=1 is used (see below). Bandwidths reported by the VMs are summed to get the total. The results are shown in Table 2: just slightly greater bandwidth than for the corresponding native case.

Table 2. Virtualized total memory bandwidth, MB/s, 2 VMs, preferHT=1

Total Threads 2 4 8 16
Copy 12535 22536 27606 27104
Scalar 10294 18824 26781 26537
Add 13578 24182 30676 30537
Triad 13070 23476 30449 30010

It is apparent that the Linux “first-touch” scheduling algorithm together with the simplicity of the Stream algorithm are enough to ensure that nearly all memory accesses in the native tests are “local” (that is, the processor each thread runs on and the memory it accesses both belong to the same NUMA node). In ESX 4.1 NUMA information is not passed to the guest OS and (by default) 8-vCPU VMs are scheduled across NUMA nodes in order to take advantage of more physical cores. This means that about half of memory accesses will be “remote” and that in the default configuration one or two VMs must produce significantly less bandwidth than the native tests. Setting preferHT=1 tells the ESX scheduler to count logical processors (hardware threads) instead of cores when determining if a given VM can fit on a NUMA node. In this case that forces both memory and CPU of an 8-vCPU VM to be scheduled on a single NUMA node. This guarantees all memory accesses are local and the aggregate bandwidth of two VMs can equal or exceed native bandwidth. Note that a single VM cannot match this bandwidth. It will get either half of it (because it’s using the resources of only one NUMA node), or about 70% (because half the memory accesses are remote). In both native and virtual environments, the maximum bandwidth of purely remote memory accesses is about half that of purely local. On machines with more NUMA nodes, remote memory bandwidth may be less and the importance of memory locality even greater.


In both native and virtualized environments, equivalent maximum memory bandwidth can be achieved as long as the application is written or configured to use only local memory. For native this means relying on the Linux “first-touch” scheduling algorithm (for simple apps) or implementing explicit mechanisms in the code (usually difficult if the code wasn’t designed for NUMA). For virtual a different mindset is needed: the application needs to be able to run across multiple machines, with each VM sized to fit on a NUMA node. On machines with hyper-threading enabled, preferHT=1 needs to be set for the larger VMs. If these requirements can be met, then a valuable feature of virtualization is that the app needs to have no NUMA awareness at all; NUMA scheduling is taken care of by the hypervisor (for all apps, not just for those where Linux is able to align threads and memory on the same NUMA node). For those apps where these requirements can’t be met (ones that need a large single instance OS), current development focus is on relaxing these requirements so they are more like native, while retaining the above advantage for small VMs.

Next: NAMD

Scale-out of XenApp on ESX 3.5

In an earlier posting (Virtualizing XenApp on XenServer 5.0 and ESX 3.5) we looked at the performance of virtualizing a Citrix XenApp workload in a 2-vCPU VM in comparison to the native OS booted with two cores. This provided valuable data about the single-VM performance of XenApp running on ESX 3.5. In our next set of experiments we used the same workload, and the same hardware, but scaled out to 8 VMs. This is compared to the native OS booted with all 16 cores. We found that ESX has near-linear scaling as the number of VMs is increased, and that aggregate performance with 8 VMs is much better than native.

We expected the earlier single-VM approach to produce representative results because of the excellent scale-out performance of ESX 3.5. This is especially true on NUMA architectures where VMs running on different nodes are nearly independent in terms of CPU and memory resources. However, the same cannot be said for the scale-up performance (SMP scaling) of a single native machine, or a single VM. As for many other applications, virtualizing many relatively small XenApp servers on a single machine can overcome the inherent SMP performance limitations of XenApp on the same machine.

In the current experiments, each VM is the same as before, except the allocated memory is set to 6700 MB (the amount needed to run 30 users). Windows 2003 x64 was used in both the VMs and natively. See the above posting for more workload and configuration details. Shown below is the average aggregate latency as a function of the total number of users. Every data point shown is a separate run with about 4 hours of steady state execution. Each user performs six iterations where a complete set of 22 workload operations is performed during each iteration. The latency of these operations is summed to get the aggregate latency. The average is over the middle four iterations, all the users, and all the VMs.


In both the Native and ESX cases all 16 cores are being used (although with much less than 100% utilization). At very low load Native has somewhat better total latency, but beyond 80 users the latency quickly degrades. Starting at 140 users some of the sessions start to fail. 120 users is really the upper limit for running this workload on Native. With 8 VMs on ESX, 20 users per VM (160 total) was not a problem at all, so we pushed the load up to 240 total users. At this point the latency is getting high, but there were no failures and all of the desktop applications were still usable. The load has to be increased to more than 200 users on ESX before the latency exceeds that from 120 users on Native. That is, for a Quality-of-Service standard of 39 seconds aggregate latency, ESX supports 67% more users than Native. Like many commonly-deployed applications, XenApp has limited SMP scalability. Roughly speaking, its scalability is better than common web apps but not as good as well-tuned databases. When running this workload, XenApp scales well to 4 CPUs, but 8 CPUs is marginal and 16 CPUs is clearly too many. Dividing the load among smaller VMs avoids SMP scaling issues and allows the full capabilities of the hardware to be utilized.

Some would say that even 200 XenApp users are not very many for such a powerful machine. In any benchmark of this kind many decisions have to be made with regard to the choice of applications, operations within each application, and amount of “user think time” between operations. As we pointed out earlier, we strove to make realistic choices when designing the VDI workload. However, one may choose to model users performing less resource-intensive operations and thus be able to support more of them.

The scale-out performance of ESX is quantified in the second chart, which shows the total latency as a function of the number of VMs with all VMs running either a low (10 users), medium (20), or high (30) load each. Flat lines would indicate perfect scalability. They are actually nearly so for the each of the load cases up to 4 VMs. The latency increases noticeably only for 8 VMs, and then only for higher loads. This indicates that the increased application latency is mostly due to the increased memory latency caused by running 2 VMs per NUMA node (as opposed to at most a single VM per node for four or fewer VMs).

While our first blog showed how low the overhead is for running XenApp on a single 2-vCPU VM on ESX compared to a native OS booted with two CPUs, the current results fully utilizing the 16 core machine are even more compelling. These show the excellent scale-out performance of ESX on a modern NUMA machine, and that the aggregate performance of several VMs can far exceed the capabilities of a single native OS.

Virtualizing XenApp on XenServer 5.0 and ESX 3.5

There has always been interest in running Citrix XenApp (formerly Citrix Presentation Server) workloads on the VMware Virtual Infrastructure platform. With the advent of multi-core systems, purchasing decisions are driven towards systems with 4-16 cores. However, using this hardware effectively is difficult due to limited scaling of the XenApp application environment. In addition to the usual benefits of virtualization, these scaling issues make running XenApp environments on ESX even more compelling.

recently ran some performance tests to understand what can be expected
in terms of performance for a virtualized XenApp workload. The results
show that ESX runs common desktop applications on XenApp with
reasonable overhead compared to a native installation, and with
significantly better performance than XenServer. We hope this data will
help provide guidance when
XenApp environments are transitioned from physical hardware to a virtualized environment.

Together with partners, we have been developing a desktop workload for over a year. The workload has been tested extensively on virtual desktop infrastructure (VDI) environments with one user per virtual machine (VM). VDI results have been presented and published in numerous locations (e.g., http://www.vmware.com/resources/techresources/1085, VMworld 2008 presentation VD2505 with Dell-EqualLogic). Great attention was paid to selecting the most relevant applications as well as to specifying the right types and amount of work each should do. Many other Terminal Services-style benchmarks fail to be representative of actual desktop users. Porting the workload from a VDI environment to the XenApp environment was straightforward.

XenApp was run in a single 14 GB 2-vCPU Virtual Machine (VM) booted with Windows Server 2003 x64. The hypervisors used were ESX 3.5 U3 and XenServer 5. The VMs for both had the appropriate tools/drivers installed. The XenServer VM had the Citrix XenApp optimization enabled. For comparison, the tests were run natively with the OS restricted to the same hardware resources. The hardware is a HP DL585 with 4 quad-core 2210 MHz “Barcelona” processors and 64 GB memory. Rapid Virtualization Indexing (RVI) was enabled.

The test consists of 22 operations, always executed in the following order:


Open Internet Explorer


Browse photos in IE


Open Excel file


Evaluate formula in Excel


Save Excel file


Open Firefox


Close Firefox


Open PDF file


Browse PDF file


Open PowerPoint file


Slideshow in PowerPoint


Edit PowerPoint file


Append to PowerPoint file


Save PowerPoint file


Open Word file


Modify Word file


Save Word file


Open Internet Explorer


Browse Apache doc in IE


Open Excel file


Sort column in Excel


Save Excel file


A “sleep” of random length is inserted between each operation to simulate user think time. One execution of the whole set of operations is called an “iteration” and takes about 57 minutes. Several of these operations consist of many sub-operations. For instance, the PPT_SLIDESHOW operation consists of 10 sub-operations where each displays a slide in a PowerPoint document followed by a pause. Only the latency to display the slide is timed, and not the time spent sleeping. The latencies of the sub-operations are summed to give the operation latency, and all the operation latencies within one iteration of one user are summed to yield the “total latency”. AutoIt3, an open-source scripting language, is used on the server side to automate the operations. CSTK Client Launcher (a utility that allows the tester to create and launch multiple ICA client sessions) is used on a client machine to start the users (sessions). Each user is started in a staggered fashion so that the last user is starting when the first user is close to finishing its first iteration. This strategy avoids synchronizing the execution of any operation across users. Each user runs six iterations. The “average total latency” is determined by averaging all the total latencies across the middle four iterations (i.e., the ones where all users are running at steady state), and across all users. Note that it is important to time many different kinds of desktop applications: timing just a few operations (or even just one as has been done in other publications) can give a very distorted view of overall performance. With a similar philosophy we gather CPU data over nearly four hours of steady state to ensure the utilization statistics are solid. The first figure shows the average total latency as a function of the number of users for XenServer, ESX, and Native.


The two horizontal lines labeled “QoS” denote the Native latency for 35 and 38 users. Either of these may be considered as a reasonable maximum Quality of Service for latency. They correspond to somewhat less or more, respectively, of half of the available CPU resources (see the CPU figure below), which is a commonly used target for XenApp. At higher utilizations not only does the latency increase rapidly but operations may start to fail. We required that all operations succeed (just like a real user expects!) for a test to be deemed successful. The points where the QoS lines cross the ESX  and XenServer curves gives the number of users that can be supported with the same total latency. Normalizing with the number of Native users (35 or 38) gives the fraction of Native users each virtualization product can support at the given total latency:


ESX consistently supports about 86% of the native number of users, while XenServer supports about 77%. Shown below is the average CPU utilization during the second to fifth iteration of the last user, given as a percentage of a single core. Perfmon was used for Native, esxtop for ESX, and xentop for XenServer. ESX uses less CPU than XenServer no matter how the comparison is made: for a given number of users, or for a given total latency:


XenApp and other products that virtualize applications are prime candidates to be run in a VM. These results show that ESX can do so efficiently compared to using a physical machine. This was shown with a benchmark that: represents a real desktop workload, uses a metric that includes latencies of all operations, and requires that all operations complete successfully. Furthermore, ESX supports about 13% more users than XenServer at a given latency while using less CPU.

ESX Runs Java Virtual Machines with Near-Native Performance

Java workloads are becoming increasingly common-place in the datacenter, and consequently Java benchmarks are among the most frequently run server performance tests. In virtualized environments, Single-VM Java performance tests are common but there are a number of reasons why they are not interesting or particularly relevant to production systems. Primary among these is system administrators nearly always run multiple VMs to make better use of their multi-core systems. Benchmarking should reflect this. While ESX performs very well on single-VM tests, we show here that ESX also achieves very close to native performance for configurations that are both realistic and well-tuned.

There are several subtleties in comparing Java performance on native and virtualized platforms:

  • Java applications are often split into multiple Java Virtual Machines (JVMs). In particular, nearly all the published SPECjbb2005 (the most widely-used Java benchmark) tests are run this way, usually with 2 CPU cores per JVM. Since this is done in order to achieve best performance, the appropriate virtualized analog is to run each JVM in a separate virtual machine (VM), with 2 virtual CPUs (vCPUs) per VM. This also allows all the advantages of virtualization to be gained (scheduling flexibility, resource allocation, high availability, etc.).
  • A native machine is often booted with a reduced number of CPUs in order to compare with a small VM. However, such experiments are usually irrelevant because different subsets of the hardware are used for the various cases, or simply because fewer resources are used than is normal. It is better to fully utilize all CPUs for all tests.
  • Disabling cache line prefetching in the BIOS and enabling large pages in the OS typically help Java application performance greatly. These are recommended for both benchmarking and general production use. On the other hand, CPU affinity can improve benchmark performance significantly but is not used here (nor is it recommended for production) due to its inflexibility.

We performed a number of experiments to study the performance of SPECjbb2005 running natively and on VMware Infrastructure 3. We first ran SPECjbb2005 natively in multi-JVM mode on a 2 socket machine equipped with 3 GHz Clovertown quad-core processors (8 cores total). The OS is Windows Server 2008 x64 booted with all 32 GB and 8 CPUs of the machine. Four JVMs were used, each with a 3700 MB Java heap. Details of the software and hardware configurations are given below. The reported throughput is 236,789 bops, which compares well with published native scores on similar hardware of around 252,000 bops (the difference is mostly accounted for by the use of CPU affinity for the published results). See, for example, http://www.spec.org/jbb2005/results/res2007q4/. We focused on out-of-the-box performance so little effort was spent on tuning; however reasonable performance is necessary to make the results credible.

For the virtualized case we ran conforming single-JVM tests in each of 4 identical VMs on ESX 3.5 U1. Each VM was booted with 2 vCPUs and 5 GB memory but otherwise the software and virtual hardware configurations were the same as the native setup. As with the native multi-JVM case, subtests with 1 through 8 warehouses were run and performance for warehouses 2 – 4 were averaged for each JVM/VM (as per the run rules). Unlike the multi-JVM case, the synchronization of the tests across the 4 VMs is never perfect; however it was never worse than 5 seconds out of the 12 minute measurement interval (4 minutes per subtest). Plus, lack of synchronization cannot give a performance advantage to any one VM, since all VMs continue to run (warehouses 5 – 8) after the measurement interval. In the first figure, we present the individual scores of the VMs and compare them to the individual scores of the native JVMs plus their average (reported throughput divided by the number of JVMs).


Each of the VMs is just 2.2 – 2.5% slower than the average of the native JVMs. Furthermore, the ESX scheduler is able to automatically ensure that each VM gets essentially the same share of the computational resources; fairness is nearly perfect. The native scheduler has a hard time doing this with JVMs: the performance of individual JVMs ranges from 11% slower to 6% faster than the average. Using CPU affinity in the native case completely fixes the fairness issue as well as increases the performance by 5%.

So why not use CPU affinity all the time? While it is often helpful for benchmarks it is usually impractical to use it effectively in real production systems. Instead of 4 VMs/JVMs which fit very nicely on our 8 core machine, what if a different number is required for other reasons (isolation, etc.)? Then affinity makes no sense at all. A little “reality” is introduced here by repeating the above tests with 5 JVMs for Native and 5 VMs for ESX. No other changes were needed except that expected_peak_warehouse had to be manually set to 2 for the native case. The reported throughput for Native was 233,073 bops, which is only a 1.6% drop from the 4 JVM test.


The VMs running on ESX are 1.3 – 3.4% slower than the average of the native JVMs. The fairness of ESX is excellent. The performance of the native JVMs ranges from 27% faster to 30% slower than the same average. This poor fairness means the performance of individual JVMs is unpredictable, even though the overall performance is good.

These results show that ESX has very little overhead for this CPU-intensive Java application, in both fully-committed and over-committed scenarios. In addition, the ESX scheduler ensures excellent fairness across the VMs. This is important for predictability of the performance of individual VMs and allows more precise resource management.

Benchmark configuration

Application:             SPECjbb2005 version 1.07

4 or 5 JVMs for Native, single JVM per VM for ESX

expected_peak_warehouse = 2

Subtests: 1 – 8 warehouses, average throughput for 2 – 4 warehouses

Java:                       BEA JRockit R27.5 64 bit for Windows

Java options:          -Xmx3700m -Xms3700m -Xns3000m -XXaggressive -Xgc:genpar

-XXgcthreads=2 -XXthroughputCompaction –XxlazyUnlocking

-XXtlaSize:min=4k,preferred=512k -XXcallProfiling

OS:                           Windows Server 2008 x64 Enterprise Edition

Large pages enabled with ‘Lock pages in memory’ in Local Security Settings

Hardware:              HP DL380 G5, 2 sockets, Xeon X5365 (3 GHz Clovertown), 32 GB

BIOS:                       Disabled ‘Hardware Prefetcher’ and ‘Adjacent Cache Line Prefetch’

Hypervisor:            ESX Server 3.5.0 U1 build 90092

Virtual hardware:  2 CPUs, 5 GB memory