ESX Runs Java Virtual Machines with Near-Native Performance

Java workloads are becoming increasingly common-place in the datacenter, and consequently Java benchmarks are among the most frequently run server performance tests. In virtualized environments, Single-VM Java performance tests are common but there are a number of reasons why they are not interesting or particularly relevant to production systems. Primary among these is system administrators nearly always run multiple VMs to make better use of their multi-core systems. Benchmarking should reflect this. While ESX performs very well on single-VM tests, we show here that ESX also achieves very close to native performance for configurations that are both realistic and well-tuned.

There are several subtleties in comparing Java performance on native and virtualized platforms:

Java applications are often split into multiple Java Virtual Machines (JVMs). In particular, nearly all the published SPECjbb2005 (the most widely-used Java benchmark) tests are run this way, usually with 2 CPU cores per JVM. Since this is done in order to achieve best performance, the appropriate virtualized analog is to run each JVM in a separate virtual machine (VM), with 2 virtual CPUs (vCPUs) per VM. This also allows all the advantages of virtualization to be gained (scheduling flexibility, resource allocation, high availability, etc.).
A native machine is often booted with a reduced number of CPUs in order to compare with a small VM. However, such experiments are usually irrelevant because different subsets of the hardware are used for the various cases, or simply because fewer resources are used than is normal. It is better to fully utilize all CPUs for all tests.
Turning off cache line prefetching in the BIOS and enabling large pages in the OS typically help Java application performance greatly. These are recommended for both benchmarking and general production use. On the other hand, CPU affinity can improve benchmark performance significantly but is not used here (nor is it recommended for production) due to its inflexibility.

We performed a number of experiments to study the performance of SPECjbb2005 running natively and on VMware Infrastructure 3. We first ran SPECjbb2005 natively in multi-JVM mode on a 2 socket machine equipped with 3 GHz Clovertown quad-core processors (8 cores total). The OS is Windows Server 2008 x64 booted with all 32 GB and 8 CPUs of the machine. Four JVMs were used, each with a 3700 MB Java heap. Details of the software and hardware configurations are given below. The reported throughput is 236,789 bops, which compares well with published native scores on similar hardware of around 252,000 bops (the difference is mostly accounted for by the use of CPU affinity for the published results). See, for example, http://www.spec.org/jbb2005/results/res2007q4/. We focused on out-of-the-box performance so little effort was spent on tuning; however reasonable performance is necessary to make the results credible.

For the virtualized case we ran conforming single-JVM tests in each of 4 identical VMs on ESX 3.5 U1. Each VM was booted with 2 vCPUs and 5 GB memory but otherwise the software and virtual hardware configurations were the same as the native setup. As with the native multi-JVM case, subtests with 1 through 8 warehouses were run and performance for warehouses 2 – 4 were averaged for each JVM/VM (as per the run rules). Unlike the multi-JVM case, the synchronization of the tests across the 4 VMs is never perfect; however it was never worse than 5 seconds out of the 12 minute measurement interval (4 minutes per subtest). Plus, lack of synchronization cannot give a performance advantage to any one VM, since all VMs continue to run (warehouses 5 – 8) after the measurement interval. In the first figure, we present the individual scores of the VMs and compare them to the individual scores of the native JVMs plus their average (reported throughput divided by the number of JVMs).

Each of the VMs is just 2.2 – 2.5% slower than the average of the native JVMs. Furthermore, the ESX scheduler is able to automatically ensure that each VM gets essentially the same share of the computational resources; fairness is nearly perfect. The native scheduler has a hard time doing this with JVMs: the performance of individual JVMs ranges from 11% slower to 6% faster than the average. Using CPU affinity in the native case completely fixes the fairness issue as well as increases the performance by 5%.

So why not use CPU affinity all the time? While it is often helpful for benchmarks it is usually impractical to use it effectively in real production systems. Instead of 4 VMs/JVMs which fit very nicely on our 8 core machine, what if a different number is required for other reasons (isolation, etc.)? Then affinity makes no sense at all. A little “reality” is introduced here by repeating the above tests with 5 JVMs for Native and 5 VMs for ESX. No other changes were needed except that expected_peak_warehouse had to be manually set to 2 for the native case. The reported throughput for Native was 233,073 bops, which is only a 1.6% drop from the 4 JVM test.

The VMs running on ESX are 1.3 – 3.4% slower than the average of the native JVMs. The fairness of ESX is excellent. The performance of the native JVMs ranges from 27% faster to 30% slower than the same average. This poor fairness means the performance of individual JVMs is unpredictable, even though the overall performance is good.

These results show that ESX has very little overhead for this CPU-intensive Java application, in both fully-committed and over-committed scenarios. In addition, the ESX scheduler ensures excellent fairness across the VMs. This is important for predictability of the performance of individual VMs and allows more precise resource management.

Benchmark configuration

Application: SPECjbb2005 version 1.07
- 4 or 5 JVMs for Native, single JVM per VM for ESX
- expected_peak_warehouse = 2
- Subtests: 1 – 8 warehouses, average throughput for 2 – 4 warehouses
Java: BEA JRockit R27.5 64 bit for Windows
Java options:
- -Xmx3700m -Xms3700m -Xns3000m -XXaggressive -Xgc:genpar
- -XXgcthreads=2 -XXthroughputCompaction –XxlazyUnlocking
- -XXtlaSize:min=4k,preferred=512k -XXcallProfiling
OS: Windows Server 2008 x64 Enterprise Edition
Large pages used with ‘Lock pages in memory’ in Local Security Settings
Hardware: HP DL380 G5, 2 sockets, Xeon X5365 (3 GHz Clovertown), 32 GB
BIOS: Turned off ‘Hardware Prefetcher’ and ‘Adjacent Cache Line Prefetch’
Hypervisor: ESX Server 3.5.0 U1 build 90092
Virtual hardware: 2 CPUs, 5 GB memory