Java workloads are becoming increasingly common-place in the datacenter, and consequently Java benchmarks are among the most frequently run server performance tests. In virtualized environments, Single-VM Java performance tests are common but there are a number of reasons why they are not interesting or particularly relevant to production systems. Primary among these is system administrators nearly always run multiple VMs to make better use of their multi-core systems. Benchmarking should reflect this. While ESX performs very well on single-VM tests, we show here that ESX also achieves very close to native performance for configurations that are both realistic and well-tuned.
There are several subtleties in comparing Java performance on native and virtualized platforms:
- Java applications are often split into multiple Java Virtual Machines (JVMs). In particular, nearly all the published SPECjbb2005 (the most widely-used Java benchmark) tests are run this way, usually with 2 CPU cores per JVM. Since this is done in order to achieve best performance, the appropriate virtualized analog is to run each JVM in a separate virtual machine (VM), with 2 virtual CPUs (vCPUs) per VM. This also allows all the advantages of virtualization to be gained (scheduling flexibility, resource allocation, high availability, etc.).
- A native machine is often booted with a reduced number of CPUs in order to compare with a small VM. However, such experiments are usually irrelevant because different subsets of the hardware are used for the various cases, or simply because fewer resources are used than is normal. It is better to fully utilize all CPUs for all tests.
- Turning off cache line prefetching in the BIOS and enabling large pages in the OS typically help Java application performance greatly. These are recommended for both benchmarking and general production use. On the other hand, CPU affinity can improve benchmark performance significantly but is not used here (nor is it recommended for production) due to its inflexibility.
We performed a number of experiments to study the performance of SPECjbb2005 running natively and on VMware Infrastructure 3. We first ran SPECjbb2005 natively in multi-JVM mode on a 2 socket machine equipped with 3 GHz Clovertown quad-core processors (8 cores total). The OS is Windows Server 2008 x64 booted with all 32 GB and 8 CPUs of the machine. Four JVMs were used, each with a 3700 MB Java heap. Details of the software and hardware configurations are given below. The reported throughput is 236,789 bops, which compares well with published native scores on similar hardware of around 252,000 bops (the difference is mostly accounted for by the use of CPU affinity for the published results). See, for example, http://www.spec.org/jbb2005/results/res2007q4/. We focused on out-of-the-box performance so little effort was spent on tuning; however reasonable performance is necessary to make the results credible.
For the virtualized case we ran conforming single-JVM tests in each of 4 identical VMs on ESX 3.5 U1. Each VM was booted with 2 vCPUs and 5 GB memory but otherwise the software and virtual hardware configurations were the same as the native setup. As with the native multi-JVM case, subtests with 1 through 8 warehouses were run and performance for warehouses 2 – 4 were averaged for each JVM/VM (as per the run rules). Unlike the multi-JVM case, the synchronization of the tests across the 4 VMs is never perfect; however it was never worse than 5 seconds out of the 12 minute measurement interval (4 minutes per subtest). Plus, lack of synchronization cannot give a performance advantage to any one VM, since all VMs continue to run (warehouses 5 – 8) after the measurement interval. In the first figure, we present the individual scores of the VMs and compare them to the individual scores of the native JVMs plus their average (reported throughput divided by the number of JVMs).
Each of the VMs is just 2.2 – 2.5% slower than the average of the native JVMs. Furthermore, the ESX scheduler is able to automatically ensure that each VM gets essentially the same share of the computational resources; fairness is nearly perfect. The native scheduler has a hard time doing this with JVMs: the performance of individual JVMs ranges from 11% slower to 6% faster than the average. Using CPU affinity in the native case completely fixes the fairness issue as well as increases the performance by 5%.
So why not use CPU affinity all the time? While it is often helpful for benchmarks it is usually impractical to use it effectively in real production systems. Instead of 4 VMs/JVMs which fit very nicely on our 8 core machine, what if a different number is required for other reasons (isolation, etc.)? Then affinity makes no sense at all. A little “reality” is introduced here by repeating the above tests with 5 JVMs for Native and 5 VMs for ESX. No other changes were needed except that expected_peak_warehouse had to be manually set to 2 for the native case. The reported throughput for Native was 233,073 bops, which is only a 1.6% drop from the 4 JVM test.
The VMs running on ESX are 1.3 – 3.4% slower than the average of the native JVMs. The fairness of ESX is excellent. The performance of the native JVMs ranges from 27% faster to 30% slower than the same average. This poor fairness means the performance of individual JVMs is unpredictable, even though the overall performance is good.
These results show that ESX has very little overhead for this CPU-intensive Java application, in both fully-committed and over-committed scenarios. In addition, the ESX scheduler ensures excellent fairness across the VMs. This is important for predictability of the performance of individual VMs and allows more precise resource management.
Benchmark configuration
- Application: SPECjbb2005 version 1.07
- 4 or 5 JVMs for Native, single JVM per VM for ESX
- expected_peak_warehouse = 2
- Subtests: 1 – 8 warehouses, average throughput for 2 – 4 warehouses
- Java: BEA JRockit R27.5 64 bit for Windows
- Java options:
- -Xmx3700m -Xms3700m -Xns3000m -XXaggressive -Xgc:genpar
- -XXgcthreads=2 -XXthroughputCompaction –XxlazyUnlocking
- -XXtlaSize:min=4k,preferred=512k -XXcallProfiling
- OS: Windows Server 2008 x64 Enterprise Edition
- Large pages used with ‘Lock pages in memory’ in Local Security Settings
- Hardware: HP DL380 G5, 2 sockets, Xeon X5365 (3 GHz Clovertown), 32 GB
- BIOS: Turned off ‘Hardware Prefetcher’ and ‘Adjacent Cache Line Prefetch’
- Hypervisor: ESX Server 3.5.0 U1 build 90092
- Virtual hardware: 2 CPUs, 5 GB memory
Where do you do this? “Turning off cache line prefetching in the BIOS”
During boot, hitting F9 gets you into the BIOS menu on HP machines I’m familiar with. Go to Advanced Options then Processor Options. turn off “HW Prefetcher” and “Adjacent Sector Prefetch”
Also, I left out the SPEC attribution:
SPEC® and the benchmark name SPECjbb® are registered trademarks of the Standard Performance Evaluation Corporation.
Any chance you kind folks are thinking about running this same test with 64-bit Linux for response comparison across virtualized platforms? I’d be very interested in the results if so.
Thanks,
Jim
Thank you very much for the follow up!
How much improvement was due to the BIOS settings, and did you try JVM performance on hosts with mixed workloads?
Thanks!
Yehuda
Jim,
No immediate plans to publish 64-bit Linux results. I don’t know any reason why the results should be significantly different.
Yehuda,
I’ve seen up to an 18% improvement for this workload by turning off prefetching. This is presumably due to reduced L2 cache misses. For mixed workloads, look at VMmark results: http://www.vmware.com/products/vmmark/ The problem with mixed workloads in the context of this blog is that there’s no good native analog.
I am not sure I understand the setup from the Benchmark configuration specs. Is this comparing a physical HP DL380 running Windows Server 2008 x64 Enterprise Edition against an ESX server running on the same hardware? Does this mean you’re running JVMs and no intermediate OS?
Please excuse my ignorance. If this is the case, I would also love to see a similar comparison against a tuned native setup. Otherwise, we could hypothesize that the native setup could contribute to making the native test drag, and not that the ESX/JVMs are performant.
Unless I totally missed something, this would be a control in this experiment as sandbagging the native setup would improve the perceived results.
Brent,
Yes, the same machine runs both Win 2008 natively and the same OS on ESX. The application sees the same JVM and OS in both cases. I’m not sure what you mean by “intermediate OS”.
Every effort was made to avoid sandbagging the native results. CPU affinity was never used because it makes no sense in the 5 JVM/VM cases. 4 JVMs/VMs were also run without affinity for consistency. By using affinity in the native 4 JVM case, we get about 99% of published performance, which indicates other parameters are well-tuned (if not quite perfect).
Is turning off ACL Prefetch and Hardware prefetch recommended for vSphere as well?
Chris,
Yes. Prefetch has very little interaction with virtualization. It’s really an application-level optimization. Turning off prefetch is important for SPECjbb but other apps may perform better with it turned on.
Any chance you could give us more information about the prefetching settings? Did you do a test with the prefetch settings on?
bolsen,
There are two prefetch settings in the BIOS. One prefetches the next cache line (assumes memory is accessed sequentially). The other uses a more complex algorithm to predict which cache line will be needed next. Both algorithms fail for SPECjbb since it accesses memory mostly randomly. I haven’t tested the performance effect recently, but enabling both prefetch settings will decrease performance on the order of 10%.
Jeff,
Thanks for the information. I have a few more questions but they should probably be asked offline (i don’t want to hijak this post). Can you send me an email?
Why do we need an OS?
Why can’t VMware teamup/write a JavaVM that runs right on the VMWare infrastructure.
Doesn’t this provide obscene flexibility to run Java on anything?
That’s a bit off-topic, but do a search for LiquidVM. Not sure of the status of that project since Oracle bought BEA. Yes, in general you can do away with the OS.