Memory overcommit, aka the ability of VMware ESX to provision memory to virtual machines in excess of what is physically available on a host, has been a topic of discussion in virtualization blogs for quite some time (e.g., “More on Memory Overcommitment”) and apparently still is (e.g. “VMware vs. Microsoft: Why Memory Overcommitment is Useful in Production and Why Microsoft Denies it” and “Microsoft responds to VMware’s ability to overcommit memory” ).
Given the benefits of memory overcommit and the fact that today only VMware ESX/ESXi has it as a standard feature, it is understandable that other vendors try to downplay it by advocating that it is irrelevant, dangerous and not used in production environments. Microsoft’s position on the topic is particularly interesting….or confusing, I should say. On one side, in an interview Bob Muglia, Microsoft VP, confirmed the usefulness of memory overcommit, announcing plans to add it to their hypervisor some time in the future (have you heard this line before?), but then on the other side they don’t miss an opportunity to speak against it. James O’Neill, also from Microsoft, in his blog even challenged us to provide a reference of a customer that is actually using it in production, promising in return to make a charitable donation of $270 to an organization of our choice.
Anyway, internally at VMware we certainly have no doubts about the importance and effectiveness of memory overcommit, but we felt that after all this discussion among vendors, and after all the confusion from Microsoft as to whether it is/isn’t important and is/isn’t on the Hyper-V roadmap, that it might be more interesting for you to hear directly from our customers. Therefore, the bulk of this post will document a survey of memory overcommit usage among VMware customers. You’ll hear directly from VMware users regarding how they leverage ESX memory overcommit in their production datacenters, with no impact to performance, to increase VM density and further reduce VMware’s already low cost per application – the most relevant metric of virtualization TCO.
(Side Note: I bet MSFT will no longer question the value of overcommit once they are finally able to list it as an upcoming Hyper-V feature),
Before jumping into the survey results, I think a few clarifications are necessary.
What is memory overcommit?
Here I won’t go into all the granular, technical details of how memory overcommit works, because there is already a ton of great literature available that explains what it is and how it works (e.g., “The Role of Memory in VMware ESX Server 3” ).
However, there are a couple points that I’d like to make regarding the functionality of and requirements for Memory Overcommit.
Memory Overcommit: Required Components
Memory overcommit is the combination of three key ingredients:
- Transparent memory page sharing
- Balloon driver
- Optimized algorithms in the hypervisor kernel
These 3 elements must all be present and work together seamlessly. One alone is not enough , unlike what some vendors would like people to believe (see “Ballooning is more than enough to do memory overcommit on Xen, Oracle says”). To date, only VMware ESX has all the necessary components, has had them since 2001, and has continued to improve them ever since.
Memory Overcommit: Security Impact
Transparent memory page sharing de-duplicates memory pages by sharing the identical pages among VMs. In doing so, it makes the shared pages “read-only” at the physical RAM level. If the VM tries to write to it, ESX will get a callback and it will create a private copy of the page for the VM that wants to write to it, while letting the other VMs continue to use the original shared page. Marking it read-only ensures that it is a secure technology, one VM won’t be able to affect any other VMs. However, if you need additional assurance of Memory Overcommit’s security, you should keep in mind that VMware ESX, with its Memory Page Sharing feature, is the only hypervisor in the market that has earned a Common Criteria Evaluation Assurance Level 4 (EAL4+) under the CSEC Common Criteria Evaluation and Certification Scheme (CCS). Therefore, only VMware ESX is approved for use in “sensitive, government computing environments that demand the strictest security.”
Why is memory overcommit important?
Memory overcommit enables customers to achieve higher VM density per virtual host, increasing consolidation ratios and providing a more efficient scale up - scale out model. Ultimately this translates into substantial savings and a lower cost per application than with alternative solutions, as Eric Horschman shows in his blog post.
While the declining cost of memory could suggest that hypervisors with no memory overcommit can get away without it, in reality throwing more memory at the problem is not a sustainable solution for a few reasons:
The number of VMs deployed grows over time
Going forward systems will be even more memory constrained than today as the number of CPU cores per server will increase considerably faster than memory capacity. As a matter of fact, in 2011 a two sockets system is expected to be capable of 64 logical CPUs and 256GB of RAM, whereas today the same system is probably capable of 8 logical CPUs and 64GB of RAM. This means that the ability of a hypervisor to efficiently manage memory will be an even more critical factor to minimize the number of servers required to run applications and ensure efficient scalability.
Memory capacity requirements aren’t determined only by application workloads, but also by a number of valuable IT services, such as: high availability, zero downtime system maintenance, power management and rapid system provisioning. Virtualization solutions that don’t allow memory overcommit corner customers into a lose-lose situation: either reduce system utilization or don’t provide the service. Thanks to memory overcommit, our customers tell us that they were able to reduce their dependence on available physical resources, avoid unnecessary purchases, and improve infrastructure utilization. (see below for few examples on how VMware customers use memory overcommit)
Enough with the clarifications - let’s move on to the customer survey ….
We conducted an online survey of 110 VMware customers essentially asking them three questions:
- Do you use memory overcommit?
- Do you use memory overcommit in test/dev, production or both?
- What is your virtual-to-physical memory ratio per ESX host (i.e., overcommit ratio)?
Here are the results:
1) 57% answered they are using memory overcommit
……so much for “nobody uses it”
2) Of the 57% who answered yes, 87% said they use it in production and test and development, 2% only in production, 11% only in test/dev
……so much for “nobody uses it in production”
3) Finally, plotting the virtual-to-physical ratios on a chart, we can see what usage looks like. Virtual-to-physical memory ratios ranged from 1.14 to 3 (average 1.8, median 1.75). 75% of the respondents use memory overcommit ratios of 1.5 or higher and 37% utilize a ratio of 2 or higher
.…..so much for “memory overcommit ratios must be low”
What the chart can’t show is that, based on our findings, companies at the low end of the memory overcommit usage spectrum tend to be recent VMware customers, while those at the high-end tend to be long standing VMware customers. This looks very similar to what we have seen happening with other VMware technologies such as VMotion: once people try it and they see how well it works, they want to extract its potential.
I believe this data clearly demonstrates that VMware customers use memory overcommit in production systems and do so with high virtual-to-physical ratios.
Finally, here is what few customers who use memory overcommit in production have to say about it:
Kadlec Medical Center - Large 188-bed hospital in southern Washington State with over 270 medical staff members and over 10,000 annual patient admissions.
“Memory overcommit is one of the unique and powerful features of VMware ESX that we leverage everyday in our production environments. Thanks to memory overcommit, we were able to increase the consolidation of production environment by over 50%, maximizing utilization without giving up on the performance of our production systems. We appreciate that VMware makes it available to customers as a standard feature of ESX” – Tim Harper, Sr. System Analyst, Kadlec Medical Center
WTC Communications - regional phone, cable, Internet provider in Kansas
"A small business like ours derived tremendous benefits from the ability of VMware ESX to overcommit memory. We cannot afford the big IT budget of a large enterprise, so we must get the most out of our production servers while guaranteeing SLAs with our customers. This is exactly what VMware ESX memory overcommit allowed us to achieve. We were able to consolidate 35 production virtual machines (both Linux and Windows) on just 3 Dell PowerEdge 2850 servers with 8GB RAM each. Typically we run our production servers at an average ratio of 1.25 virtual-to-physical memory, however during maintenance operations, the ratio increases to 1.88 as we VMotion VMs out of the host that undergoes maintenance completely transparently to the users. Memory overcommit adds unparallel flexibility to our infrastructure and saves us a lot of money not just by allowing higher consolidation, but also by eliminating the need for spare capacity to perform routine maintenance operations. Memory overcommit is a fully automated feature of ESX and it is extremely simple to use. It is really a no brainier.” -- Jim Jones, Network Administrator, WTC Communications
U.S. Department of Energy - Savannah River
"Our virtualization effort began 4 years ago, and we have made great strides in server utilization since then. After upgrading to VI3, we took advantage of VMware memory overcommit. We now routinely overcommit memory at a 2:1 ratio in our production environments and have even reached 3:1 on occasion. We even run large applications such as Lotus Domino and SQL server 2008 in VMs but this has not been an issue - no performance impact. As a result, we fully trust VMware memory overcommit in our production environments. Our IT budget is tight so in the past we have had to wait over 6 months to receive a new server. By using memory over commit, we can now deploy a system in less than 30 minutes without waiting for a new server. This keeps our internal customers very happy," - Joseph Collins, Senior Systems Engineer, U.S. Department of Energy – Savannah River