Home > Blogs > VMware VROOM! Blog > Monthly Archives: February 2007

Monthly Archives: February 2007

Studying NUMA with VMmark

I often collect data from the hardware performance counters when running VMmark. Studying low-level data provides insight into both the physical system and the virtualization layer. (To be honest, the latent academic in me is secretly hoping to make some crucial observation that will change the direction of computer architecture for years to come and lead to a tenured faculty position at a top university. I know, I know. Stop laughing. Even geeks need dreams.) In a consolidation scenario, customers will often virtualize a large number of small servers within a few large servers. Many of these servers are built with a non-uniform memory access (NUMA) system architecture. In a NUMA system, the processors are divided into sets, also known as nodes. Each node has direct, local access to a portion of the overall system memory and must communicate via a network interconnect to access the remaining, remote memory at other NUMA nodes. The memory interconnect adds latency to remote memory accesses, making them slower than local ones. Applications that heavily utilize the faster local memory tend to perform better than those that don’t. VMware ESX Server is fully NUMA-aware and exploits local memory to improve performance. (See http://www.vmware.com/pdf/esx2_NUMA.pdf for more details.)


I recently performed some NUMA characterizations using VMmark on an older HP DL585 with 4 2.2 GHz dual-core Opterons. In the DL585, each dual-core processor is in its own NUMA node. I wanted to measure how heavily we stress the NUMA interconnect links, known as HyperTransport (HT) on the Opteron. I ran tests with one VMmark tile (6 VMs), two VMmark tiles (12 VMs), three VMmark tiles (18VMs), and four VMmark tiles (24 VMs). The tests consumed 27%, 58%, 90%, and 100% of the system CPU resources, respectively. Figure 1 shows the average utilization of the HT NUMA links for each test during steady state. The most important result is that the HT utilization remains below 20% in all cases. This implies that we have a large amount of headroom in the memory subsystem, which can be used as processor speeds increase. More importantly, the transition to quad-core systems should also be smooth, especially since newer versions of the HT links should provide even better performance. The other interesting feature in Figure 1 is the drop in HT utilization between the 18 VM case and the 24 VM case. The ESX scheduler’s first choice is to keep a VM running on the same processor(s) that have fast local access to that VM’s memory. When running with 18 VMs, the the system CPU is 90% busy, so there are opportunities for the ESX scheduler to move a VM away from its memory since, on a busy system, running with a slightly slower memory can be preferable to waiting a little longer for faster memory. When running 24 VMs, the scheduler has no surplus CPU resources and the VMs are much more likely to remain queued on the processor(s) close to their allocated memory, which leads to fewer remote memory accesses and lower HT utilization.


Figure 2 shows the average number of both local and remote memory requests per second. Here we see that the number of remote requests follows the same shape as the HT utilization curve in Figure 1 while the number of local requests climbs more than expected going from 90% CPU utilization at 18 VMs to 100% utilization at 24 VMs. This data demonstrates that the ESX scheduler is doing an excellent job of improving performance by taking advantage of the underlying memory architecture.


I then repeated the experiment with the DL 585 configured in memory-interleave (non-NUMA) mode in order to quantify the benefits of using NUMA on this system. Figure 3 shows the steady-state HT utilization in this configuration. (I also replotted the NUMA results to make comparison simpler.) Overall, the curve is as expected. It shows substantially higher utilization in all cases and continues to rise with 24 VMs. Figure 4 shows that remote accesses now account for the bulk of memory accesses, which is the opposite of the NUMA case. The tests also consumed slightly more CPU resources than the NUMA configuration at each load level due to the higher average memory latencies caused by the high proportion of remote accesses. The average CPU utilization was 30%, 62%, 95%, and 100% with 6 VMs, 12 VMs, 18 VMs, and 24 VMs, respectively.


The performance differences between NUMA and interleaved memory can also be quantified by examining the overall VMmark benchmark scores shown in Figure 5. In each case the NUMA configuration achieves higher throughput than the interleaved configuration. The performance difference grows as the load increases. As the fourth tile is added, the NUMA configuration has roughly 10% idle cycles versus roughly 5% in the interleaved case. As a result, the NUMA configuration sees a measurably larger throughput boost from the fourth tile.


All in all, I am quite pleased with the results. They tell us that we need not worry about overstressing NUMA systems even as vendors make quad-core processors ubiquitous. In fact, I would say that virtual environments are a great match for commodity NUMA-based multi-core systems due to the encapsulation of memory requests within a virtual machine, which creates a largely local access pattern and limits stress on the memory subsystem. Of equal importance, these results show that the ESX scheduler exploits these types of systems well, which is good to see given how much work I know our kernel team has put into it. This type of exercise is just another area where a robust, stable, and representative virtualization benchmark like VMmark can prove invaluable.

Dell Uses VMmark to Compare Servers

One of the many benefits of having a reliable and representative benchmark like VMmark is that it allows vendors to demonstrate and compare their platforms when running virtualized environments. This type of information can help customers evaluate various platforms and choose one that fits their needs. Dell recently published a study titled "Virtualization Performance of Dell PowerEdge Servers using the VMmark Benchmark." You can find it at: http://www.dell.com/downloads/global/solutions/poweredge_vmmark_final.pdf. I don’t want to steal their thunder, so check it out for yourself. I think you will agree that more of this type of information is needed. We’ll continue working to make sure it happens.

A Performance Comparison of Hypervisors

At VMworld last November I had the opportunity to talk to many ESX users and to discover for myself what performance issues were most on their minds. As it turned out, this endeavor was not very successful; everybody was generally happy with ESX performance. On the other hand the performance and best practice talks were among the most popular, indicating that users were very interested in learning new ways of getting the most out of ESX. VMworld was just the wrong audience to reach people who had concerns about performance. I was preaching to the choir, instead of the non-virtualized souls out there. At the same time aggresive marketing by other virtualization companies creates confusion about ESX performance. So it was decided that we needed to make a better effort at clearing misconceptions and providing real performance data, especially to enterprises just starting to consider their virtualization options.

A Performance Comparison of Hypervisors is the first fruit of this effort. In this paper we consider a variety of simple benchmarks running in a Windows guest on both ESX 3.0.1 and the open-source version of Xen 3.0.3. We chose Windows guests for this first paper since it’s the most widely used OS on x86 systems. We used open-source Xen 3.0.3 for these tests since it was the only Xen variant that supported Windows guests at the time we ran the tests. Everything was run on an IBM X3500 with two dual-core Intel Woodcrest processors. Xen used the hardware-assist capabilities of this processor (Intel-VT) in order to run an unmodified guest while ESX used VMware’s very mature Binary Translation technology. The results might not be what you expect from reading marketing material! Even for CPU and memory benchmarks dominated by direct execution, Xen shows significantly more overhead than ESX. The difference is bigger for a compilation workload, and huge for networking. The latter is due mostly to a lack of open-source paravirtualized (PV) device drivers for Windows. PV drivers are available in some commercial products based on Xen and should give much better performance. Xen was not able to run SPECjbb2005 at all since SMP Windows guests were not supported at the time the tests were done. This support was added very recently to Xen 3.0.4, however the commercial products are still on Xen 3.0.3. ESX has had PV network drivers (vmxnet) and been able to run SMP Windows guests for years.

We are currently exploring the many dimensions of the performance matrix; 64 bit, Linux guests, AMD processors, more complex benchmarks, etc. Results will be posted to VMTN as they are obtained. Readers are encouraged to perform their own tests and measure the performance for themselves.

Please give us your feedback on this paper and the usefulness to you of this kind of work in general. And if ESX fans find this paper informative, so much the better!