Studying NUMA with VMmark

I often collect data from the hardware performance counters when running VMmark. Studying low-level data provides insight into both the physical system and the virtualization layer. (To be honest, the latent academic in me is secretly hoping to make some crucial observation that will change the direction of computer architecture for years to come and lead to a tenured faculty position at a top university. I know, I know. Stop laughing. Even geeks need dreams.) In a consolidation scenario, customers will often virtualize a large number of small servers within a few large servers. Many of these servers are built with a non-uniform memory access (NUMA) system architecture. In a NUMA system, the processors are divided into sets, also known as nodes. Each node has direct, local access to a portion of the overall system memory and must communicate via a network interconnect to access the remaining, remote memory at other NUMA nodes. The memory interconnect adds latency to remote memory accesses, making them slower than local ones. Applications that heavily utilize the faster local memory tend to perform better than those that don’t. VMware ESX Server is fully NUMA-aware and exploits local memory to improve performance. (See http://www.vmware.com/pdf/esx2_NUMA.pdf for more details.)

I recently performed some NUMA characterizations using VMmark on an older HP DL585 with 4 2.2 GHz dual-core Opterons. In the DL585, each dual-core processor is in its own NUMA node. I wanted to measure how heavily we stress the NUMA interconnect links, known as HyperTransport (HT) on the Opteron. I ran tests with one VMmark tile (6 VMs), two VMmark tiles (12 VMs), three VMmark tiles (18VMs), and four VMmark tiles (24 VMs). The tests consumed 27%, 58%, 90%, and 100% of the system CPU resources, respectively. Figure 1 shows the average utilization of the HT NUMA links for each test during steady state. The most important result is that the HT utilization remains below 20% in all cases. This implies that we have a large amount of headroom in the memory subsystem, which can be used as processor speeds increase. More importantly, the transition to quad-core systems should also be smooth, especially since newer versions of the HT links should provide even better performance. The other interesting feature in Figure 1 is the drop in HT utilization between the 18 VM case and the 24 VM case. The ESX scheduler’s first choice is to keep a VM running on the same processor(s) that have fast local access to that VM’s memory. When running with 18 VMs, the the system CPU is 90% busy, so there are opportunities for the ESX scheduler to move a VM away from its memory since, on a busy system, running with a slightly slower memory can be preferable to waiting a little longer for faster memory. When running 24 VMs, the scheduler has no surplus CPU resources and the VMs are much more likely to remain queued on the processor(s) close to their allocated memory, which leads to fewer remote memory accesses and lower HT utilization.

Figure 2 shows the average number of both local and remote memory requests per second. Here we see that the number of remote requests follows the same shape as the HT utilization curve in Figure 1 while the number of local requests climbs more than expected going from 90% CPU utilization at 18 VMs to 100% utilization at 24 VMs. This data demonstrates that the ESX scheduler is doing an excellent job of improving performance by taking advantage of the underlying memory architecture.

I then repeated the experiment with the DL 585 configured in memory-interleave (non-NUMA) mode in order to quantify the benefits of using NUMA on this system. Figure 3 shows the steady-state HT utilization in this configuration. (I also replotted the NUMA results to make comparison simpler.) Overall, the curve is as expected. It shows substantially higher utilization in all cases and continues to rise with 24 VMs. Figure 4 shows that remote accesses now account for the bulk of memory accesses, which is the opposite of the NUMA case. The tests also consumed slightly more CPU resources than the NUMA configuration at each load level due to the higher average memory latencies caused by the high proportion of remote accesses. The average CPU utilization was 30%, 62%, 95%, and 100% with 6 VMs, 12 VMs, 18 VMs, and 24 VMs, respectively.

The performance differences between NUMA and interleaved memory can also be quantified by examining the overall VMmark benchmark scores shown in Figure 5. In each case the NUMA configuration achieves higher throughput than the interleaved configuration. The performance difference grows as the load increases. As the fourth tile is added, the NUMA configuration has roughly 10% idle cycles versus roughly 5% in the interleaved case. As a result, the NUMA configuration sees a measurably larger throughput boost from the fourth tile.

All in all, I am quite pleased with the results. They tell us that we need not worry about overstressing NUMA systems even as vendors make quad-core processors ubiquitous. In fact, I would say that virtual environments are a great match for commodity NUMA-based multi-core systems due to the encapsulation of memory requests within a virtual machine, which creates a largely local access pattern and limits stress on the memory subsystem. Of equal importance, these results show that the ESX scheduler exploits these types of systems well, which is good to see given how much work I know our kernel team has put into it. This type of exercise is just another area where a robust, stable, and representative virtualization benchmark like VMmark can prove invaluable.