This is the third part in an ongoing series on exploring performance issues of virtualizing HPC applications. In the first part, we described the setup and considered pure memory bandwidth using Stream. The second part considered the effect of network latency in a scientific application (NAMD) that ran across several virtual machines. Here we look at two of the tests in the HPC Challenge (HPCC) suite: StarRandomAccess and HPL. While certainly not spanning all possible memory access patterns found in HPC apps, these two tests are very different from each other and should help to give bounds on virtualization overhead related to these patterns.
Virtualization adds indirection to memory page table mappings: in addition to the logical-to-physical mappings maintained by the OS (either native or in a VM), the hypervisor must maintain guest physical-to-machine mappings. A straightforward implementation of both mappings in software would result in enormous overhead. Prior to the introduction of hardware MMU features in Intel (EPT) and AMD (RVI) processors, the performance problem was solved through the use of “shadow” page tables. These collapsed the two mappings to one so that the processor TLB cache could be used efficiently; however, updating shadow page tables is expensive. With EPT and RVI, both mappings are cached in the TLB, eliminating the need for shadow page tables. The trade-off is that a TLB miss can be expensive: the cost is not just double the cost of a miss in a conventional TLB; it is the square of the number of steps in the TLB page walk. This cost can be reduced by using large memory pages (2MB in x86_64) which typically need four steps in the TLB, rather than small pages (4KB) which need five. This overview is highly simplified; see the performance RVI and EPT whitepapers for much more detail about MMU virtualization, as well as results from several benchmarks representing enterprise applications. Here we extend the EPT paper to HPC apps running on a current version of vSphere.
Although there are certainly exceptions, two memory characteristics are common to HPC applications: a general lack of page table manipulation, and heavy use of memory itself. Memory is allocated once (along with the associated page tables) and used for a long time. This use can either be dominated by sequential accesses (running through an array), or by random accesses. The latter will put more stress on the TLB. Common enterprise apps are often the opposite: much heavier page table activity but lighter memory usage. Thus HPC apps do not benefit much from the elimination of shadow page tables (this alone made many enterprise apps run close to native performance as shown in the above papers), but may be sensitive to the costs of TLB misses.
These points are illustrated by two tests from the HPCC suite. StarRandomAccess is a relatively simple microbenchmark that continuously accesses random memory addresses. HPL is a standard floating-point linear algebra benchmark that accesses memory more sequentially. For these tests, version 1.4.1 of HPCC was used on RHEL 5.5 x86_64. Hyper-threading was turned off in the BIOS and all work was limited to a single socket (automatically in the virtual cases and forced with numactl for native). In this way, the effects of differences between native and virtual in how HT and NUMA are treated were eliminated. For virtual, a 4-vCPU VM with 22GB was used on a lab version (build 294208) of ESX 4.1. The relevant HPCC parameters are N=40000, NB=100, P=Q=2, and np=4. These values ensure all the CPU resources and nearly all the available memory of one socket was consumed, thereby minimizing memory cache effects. The hardware is the same as in the first part of this series. In particular, Xeon X5570 processors with EPT are used.
Throughput results for StarRandomAccess are shown in Table 1. The metric GUP/s is billions of updates per second, a measure of memory bandwidth. Small/large pages refers to memory allocation in the OS and application. For virtual, ESX always backs guest memory with large pages, if possible (as it is here). The default case (EPT enabled, small pages in the guest) achieves only about 85% of native throughput. For an application with essentially no I/O or privileged instructions that require special handling by the hypervisor, this is surprisingly poor at first glance. However, this is a direct result of the hardware architecture needed to avoid shadow page tables. Disabling EPT results in near-native performance because, now, the TLB costs are essentially the same as for native and the software MMU costs are minimal. TLB costs are still substantial as seen by the effect of using large pages in native and the guest OS: more than doubling of the performance. The virtualization overhead is reduced to manageable levels, although there is still a 2% benefit from disabling EPT.
Table 1. StarRandomAccess throughput, GUP/s (ratio to native)
|EPT on||EPT off|
|Small pages||0.01842||0.01561 (0.848)||0.01811 (0.983)|
|Large pages||0.03956||0.03805 (0.962)||0.03900 (0.986)|
Table 2 shows throughput results for HPL. The metric Gflops/s is billions of floating point operations per second. Memory is largely accessed sequentially, greatly reducing the stress on the TLB and the effect of large pages. Large pages improve virtual performance by 4%, but improve native performance by less than 2%. Disabling EPT improves virtual performance by only 0.5%. It is not clear why virtual is slightly faster than native in the large pages case; this will be investigated further.
Table 2. HPL throughput, Gflop/s (ratio to native)
|EPT on||EPT off|
|Small pages||37.04||36.04 (0.973)||36.22 (0.978)|
|Large pages||37.74||38.24 (1.013)||38.42 (1.018)|
While hardware MMU virtualization with Intel EPT and AMD RVI has been a huge benefit for many applications, these test results support the expectation that the benefit for HPC apps is smaller, and can even increase overhead in some cases. However, the example shown here where the latter is true is a microbenchmark that focuses on the worst case for this technology. Most HPC apps will not have so many random memory accesses, so the effect of EPT is likely to be small.