Home > Blogs > Virtual Reality


A Big Step Backwards for Virtualization Benchmarking

Why There's Still a Benchmarking Clause in Our EULA

We have a regularly repeating discussion here at VMware regarding benchmarking that goes along these lines:

Executive A: It seems like most of the people writing about virtualization have figured out that performance benchmarks need some special treatment when used with hypervisors.  It appears that our performance and benchmarking best practices guidelines are making an impact.  They've been available for a while and we're not seeing as many articles with badly flawed tests as we used to.  You know, the tests with bizarre results that come from errors like trying to measure network performance when the CPUs are saturated, or timing benchmark runs using the VM's system clock, or measuring a VM's disk I/O when everything is cached in host memory.  Perhaps it's finally time to drop the clause in our EULA that requires VMware review of performance tests before publication.

Executive B: That would be great!  We respond to every request for a benchmark review and we work with the submitters to improve their test processes, but VMware still gets criticized by competitors who claim we use that clause in our EULA to unfairly block publication of any ESX benchmarks that might not be favorable to VMware.  Even vendors whose benchmarks have been approved by us complain that it's an unreasonable restriction.  If we drop the clause, then maybe everyone will stop complaining and, since it seems people now understand how to benchmark a virtualization solution, we won't see as many botched tests and misleading results.

Executive A: OK, then it's agreed — we'll drop the EULA benchmark clause in our next release.

And then something like this gets published causing us to lose faith once again with the benchmarking wisdom of certain members of the virtualization community and we're back to keeping the clause in our EULA.

Bad Benchmarking in Action

To summarize, the bad news was in a hypervisor benchmark published by Virtualization Review that showed ESX trailing the other guys in some tests and leading in others.  It was a benchmark unlike any we'd seen before and it left us scratching our heads because there were so few details and the results made no sense whatsoever. Of course, Microsoft didn't let the benchmark's flaws stop them from linking to the article claiming it as proof that Hyper-V performs better than other hypervisors.  As near as we can tell, the Virtualization Review test consisted of a bunch of VMs each running a PC burn-in test program along with a database VM running a SQL Server script.  To be fair to Virtualization Review, they had given us a heads up some time ago that they would be running a test and we gave them some initial cautions that weren't heeded, but we certainly never approved publication of the ESX test results.  If we had an opportunity to review the test plan and results, our performance experts would have some long discussions with the author on a range of issues.

Take for instance the results of the third test in the series, as published in the article:

Test 3 Component Hyper-V XenServer VMware ESX
CPU Operations (millions) 5000 3750 7080
RAM Operations (millions) 1080 1250 1250
Disk Operations (millions) 167 187 187
SQL Server (m:ss) 4:43 5:34 5:34

A cursory glance would suggest that one hypervisor demonstrated a performance win in this test. In fact, it is actually very difficult to draw any conclusions from these results.  We at VMware noticed that the ESX numbers reported for CPU Operations seemed to be 40% greater than for Hyper-V and 88% better than for XenServer.  Is ESX really that good, and XenServer and Hyper-V really that bad?  We'd like to take credit for a win, but not with this flawed test.

What’s happening here is that there are a wide variety of problems with this configuration – we found many of them during our inspection of the tests:

  • The fact that ESX is completing so many more CPU, memory, and disk operations than Hyper-V obviously means that cycles were being used on those components as opposed to SQL Server.  Which is the right place for the hypervisor to schedule resources?  It’s not possible to tell from the scarce details in the results.
  • All resource-intensive SQL Servers in virtual and native environments have large pages enabled.  ESX supports this behavior but no other hypervisor does.  This test didn’t use that key application and OS feature.
  • The effects of data placement with respect to partition alignment were not planned for.  VMware has documented the impact of this oversight to be very significant in some cases.
  • The disk tests are based on Passmark’s load generation, which uses a test file in the guest operating system.  But the placement of that file, and its alignment with respect to the disk system, was not controlled in this test.
  • The SQL Server workload was custom built and has not been investigated, characterized, or understood by anyone in the industry. As a result, its sensitivity to memory, CPU, network and storage changes is totally unknown, and not documented by the author.  There are plenty of industry standard benchmarks to use with hypervisors and the days of ad hoc benchmark tests have passed.  Virtual machines are fully capable of running the common benchmarks that users know and understand like TPC, SPECweb and SPECjbb.  An even better test is VMmark, a well-rounded test of hypervisor performance that has been adopted by all major server vendors as the standard measurement of virtualization platforms or the related SPECvirt benchmark under development by SPEC.
  • With ESX’s highest recorded storage throughput already measured at over 100,000 IOPS on hundreds of disks, this experiment’s use of an undocumented, but presumably very small, number of spindles would obviously result in a storage system bottleneck. Yet storage performance results vary by tremendous amounts. Clearly there's an inconsistency in the configuration.

We're Not Against Benchmarking – We’re Only Against Bad Benchmarking

Benchmarking is a difficult process fraught with error and complexity at every turn. It’s important for those attempting to analyze performance of systems to understand what they’re doing to avoid drawing the wrong conclusions or allowing their readers to do so. For those that would like help from VMware, we invite you to obtain engineering assistance from benchmark@vmware.com. And everyone can benefit from the recommendations in the Performance Best Practices and Benchmarking Guidelines paper.  Certainly the writers at Virtualization Review can.

Postscript: Chris Wolf of Burton Group commented on virtualization benchmarks in his blog. He points out the need for vendor independent virtualization benchmarks as promised by programs like SPECvirt.  I couldn't agree more.  VMware got the ball rolling with VMmark, which is a public industry standard, and we're fully supporting development of SPECvirt.

16 thoughts on “A Big Step Backwards for Virtualization Benchmarking

  1. Duncan

    I’m not a performance experts but these figures didn’t make sense at all. I’m glad you pointed this out. I hope the author will contact you to get some hints/tips to improve his test.

  2. Architect

    let’s look at the content of this response…. “Well they didn’t say start” is what I see. From an Engineering perspective I would like to have more detail of the test but based on reading an article and reading a book is probably where they had to draw the line. And as the writer stated they did contact them and I would be confused if when they talked to them they wouldn’t have asked about the testing method.

  3. Architect

    LOL. Give me a break. So you got a bad performance review and now you are trying to justify it. You sound like a little child in school who gets a bad grade on a test and tries to tell your parents that we did poorly because the dog ate your notes for the test. Quit your whining and grow up.

  4. .conne.

    are you kidding me?
    “oh we were going to drop the benchmarking restriction from the EULA but now that you have stomped our feet, we’ve decided not to.”
    how childish is that? does anyone actually buy this?

  5. szlevi

    “but we certainly never approved publication of the ESX test results.”
    What a NONSENSE. a completely twisted corporate control-freak, manipulation-ridden state of mind…
    “We’re Not Against Benchmarking – We’re Only Against Bad Benchmarking”
    …you guys are truly PATHETIC, indeed.

  6. Eric Horschman

    I knew this post would get responses from those claiming we’d only object to benchmarks where VMware came out ahead. That’s why the test case I cited from this report as untrustworthy was the one that showed us having the best CPU, memory and disk scores. We don’t want credit if we win in a flawed test. Instead, we want benchmarkers to use standardized, fair and reproducible tests, even if we don’t come out ahead.

  7. Chuck Hollis

    Having read all the various posts, it’s clear to me that benchmarking hypervisors results in the same categorical problem as benchmarking databases, storage arrays, high-end servers.
    We even use an acronym for this: YMMV — your mileage may vary.
    Four well-meaning teams can set out to benchmark a set of similar products, and most likely return back with four divergent results.
    As a result EMC, there’s only one benchmark we support: a customer’s real-world workloads. All others are simply asking for trouble. And we get all sorts of competitive heat for this unpopular stance.
    Worse, benchmarks tend to reinforce the incorrect thought that performance is the be-all and end-all of a product. Harmful to customers and the marketplace in general, IMHO.
    Even worse, any unfavorable benchmark associated with a market leader becomes the sole marketing campaign of weaker players in the market, which is what seems to be happening here.
    The lesser of all evils would be to publish how well VMware stands up against demanding real-world workloads, and continue to try and avoid benchmarks altogether.
    I feel your pain!
    – Chuck

  8. Tarry Singh

    Hang on, Hang on. First its good to go back to the basics and understand why and how to benchmarks.
    Also to understand what forms of virtualization we are benchmarking against.
    First things first:
    - I think its very good that VMware allows you all to post while you are being shamelessly “anonymous” and complaining about the comments.
    - I have benchmarked databases on various systems and various flavors of databases and understand very well to even “cleverly” conceal the results by hiding and exposing flags that will typically show results that might “seem” to tilt to the favor of a specific SAN vendor or a database vendor. Why choose HP’s EVA above EMC or Why Choose NetApp above HP or EMC above Sun. and then take this discussion to whether to pick Oracle above SQL Server or other flavors.
    Have I ever done that? NO. Simply because no matter what you show, the truth will come out since all the stuff we cook and bake in this industry services the needs of the consumer, the customers–which ou and I , directly or indirectly serve.
    - A vendor reserves the right to defend its benchmarks, especially in this wild and mad blog frenzied, anonymous, traffic hungry, slutty web-culture (excuse my french).
    I’ve lost battles of VDI to TS, purely on reasons that had nothing to do with benchmarks.
    Tarry

  9. Angelo

    Not to agree with this test, as I think a more in depth analysis of how they ran it should be linked to the main article, and the numbers themselves show inconsistencies.
    But giving up on allowing perf tests because of this is really ridiculous.

  10. Paul

    So the reason Execs A and B retain the benchmark EULA clause is that WHILE IT’S IN, bad benchmarks appear?

  11. OMGboy

    VMware does not disapprove bad results! Only tests that are not optimised for their product! I’m sure there’s a difference in there somewhere.
    How can anyone say that ANY independent methodology is bad, as long as it’s applied equally to all? As a real system administrator with real environment (read: outside of a test lab), I don’t have the luxury of “ideal” conditions. I want to read and know about how products work in all sorts of situations, not just “VMware approved” ones.
    Vmware talks about unfair bottlenecks, but if that was true… would not the same bottlenecks apply to everyone? and therefore produce the exact same results?
    While i agree that *THESE* results don’t really mean a lot, it’s the principle of it.
    Any truly independent study should *NOT* involve the approval of the parties being studied. Nor should any independent study be commissioned by the vendor. Because then it’s just not independent…
    OMGboy – Happy ESX customer (except for that part of the EULA)

  12. tarbour

    I hope Vmware has learned something other than to be more arrogant and defensive. I appreciate the independent tests – Vmware should concentrate on supporting additional hardware along with improving performance and a LOT LESS time crying about tests like these. Every time I have a little issue that can be addressed by Vmware, rather than trying to provide a real solution, I get some lame answer as to why it works that way and why I should like it that way. This post just solidifies the arrogance of the company…

  13. MDF

    As much as I like XenServer, I noticed something flawed in that review right away.
    Correct me if I’m wrong, but ESX 3.5 doesn’t support the same CPU features as XenServer and HyperV. I think you need to use ESX 4.x or later to have a more apples-to-apples comparison.
    Is that right or am I mistaken?

  14. Aaron Toponce

    This is rather entertaining, actually. It seems to me that VMWare is more interested in sweeping some dirt under the rug, than being fully transparent. What are you trying to hide?
    Now that KVM and Xen are heating up the virtualization market space, why not let people publish benchmarks, bad or good? If VMWare is truly King, then what are you afraid of?

Comments are closed.