Why There's Still a Benchmarking Clause in Our EULA
We have a regularly repeating discussion here at VMware regarding benchmarking that goes along these lines:
Executive A: It seems like most of the people writing about virtualization have figured out that performance benchmarks need some special treatment when used with hypervisors. It appears that our performance and benchmarking best practices guidelines are making an impact. They've been available for a while and we're not seeing as many articles with badly flawed tests as we used to. You know, the tests with bizarre results that come from errors like trying to measure network performance when the CPUs are saturated, or timing benchmark runs using the VM's system clock, or measuring a VM's disk I/O when everything is cached in host memory. Perhaps it's finally time to drop the clause in our EULA that requires VMware review of performance tests before publication.
Executive B: That would be great! We respond to every request for a benchmark review and we work with the submitters to improve their test processes, but VMware still gets criticized by competitors who claim we use that clause in our EULA to unfairly block publication of any ESX benchmarks that might not be favorable to VMware. Even vendors whose benchmarks have been approved by us complain that it's an unreasonable restriction. If we drop the clause, then maybe everyone will stop complaining and, since it seems people now understand how to benchmark a virtualization solution, we won't see as many botched tests and misleading results.
Executive A: OK, then it's agreed -- we'll drop the EULA benchmark clause in our next release.
And then something like this gets published causing us to lose faith once again with the benchmarking wisdom of certain members of the virtualization community and we're back to keeping the clause in our EULA.
Bad Benchmarking in Action
To summarize, the bad news was in a hypervisor benchmark published by Virtualization Review that showed ESX trailing the other guys in some tests and leading in others. It was a benchmark unlike any we'd seen before and it left us scratching our heads because there were so few details and the results made no sense whatsoever. Of course, Microsoft didn't let the benchmark's flaws stop them from linking to the article claiming it as proof that Hyper-V performs better than other hypervisors. As near as we can tell, the Virtualization Review test consisted of a bunch of VMs each running a PC burn-in test program along with a database VM running a SQL Server script. To be fair to Virtualization Review, they had given us a heads up some time ago that they would be running a test and we gave them some initial cautions that weren't heeded, but we certainly never approved publication of the ESX test results. If we had an opportunity to review the test plan and results, our performance experts would have some long discussions with the author on a range of issues.
Take for instance the results of the third test in the series, as published in the article:
|Test 3 Component||Hyper-V||XenServer||VMware ESX|
|CPU Operations (millions)||5000||3750||7080|
|RAM Operations (millions)||1080||1250||1250|
|Disk Operations (millions)||167||187||187|
|SQL Server (m:ss)||4:43||5:34||5:34|
A cursory glance would suggest that one hypervisor demonstrated a performance win in this test. In fact, it is actually very difficult to draw any conclusions from these results. We at VMware noticed that the ESX numbers reported for CPU Operations seemed to be 40% greater than for Hyper-V and 88% better than for XenServer. Is ESX really that good, and XenServer and Hyper-V really that bad? We'd like to take credit for a win, but not with this flawed test.
What’s happening here is that there are a wide variety of problems with this configuration – we found many of them during our inspection of the tests:
- The fact that ESX is completing so many more CPU, memory, and disk operations than Hyper-V obviously means that cycles were being used on those components as opposed to SQL Server. Which is the right place for the hypervisor to schedule resources? It’s not possible to tell from the scarce details in the results.
- All resource-intensive SQL Servers in virtual and native environments have large pages enabled. ESX supports this behavior but no other hypervisor does. This test didn’t use that key application and OS feature.
- The effects of data placement with respect to partition alignment were not planned for. VMware has documented the impact of this oversight to be very significant in some cases.
- The disk tests are based on Passmark’s load generation, which uses a test file in the guest operating system. But the placement of that file, and its alignment with respect to the disk system, was not controlled in this test.
- The SQL Server workload was custom built and has not been investigated, characterized, or understood by anyone in the industry. As a result, its sensitivity to memory, CPU, network and storage changes is totally unknown, and not documented by the author. There are plenty of industry standard benchmarks to use with hypervisors and the days of ad hoc benchmark tests have passed. Virtual machines are fully capable of running the common benchmarks that users know and understand like TPC, SPECweb and SPECjbb. An even better test is VMmark, a well-rounded test of hypervisor performance that has been adopted by all major server vendors as the standard measurement of virtualization platforms or the related SPECvirt benchmark under development by SPEC.
- With ESX’s highest recorded storage throughput already measured at over 100,000 IOPS on hundreds of disks, this experiment’s use of an undocumented, but presumably very small, number of spindles would obviously result in a storage system bottleneck. Yet storage performance results vary by tremendous amounts. Clearly there's an inconsistency in the configuration.
We're Not Against Benchmarking – We’re Only Against Bad Benchmarking
Benchmarking is a difficult process fraught with error and complexity at every turn. It’s important for those attempting to analyze performance of systems to understand what they’re doing to avoid drawing the wrong conclusions or allowing their readers to do so. For those that would like help from VMware, we invite you to obtain engineering assistance from firstname.lastname@example.org. And everyone can benefit from the recommendations in the Performance Best Practices and Benchmarking Guidelines paper. Certainly the writers at Virtualization Review can.
Postscript: Chris Wolf of Burton Group commented on virtualization benchmarks in his blog. He points out the need for vendor independent virtualization benchmarks as promised by programs like SPECvirt. I couldn't agree more. VMware got the ball rolling with VMmark, which is a public industry standard, and we're fully supporting development of SPECvirt.