Home > Blogs > VMware VROOM! Blog > Monthly Archives: November 2010

Monthly Archives: November 2010

HPC Application Performance on ESX 4.1: Memory Virtualization

This is the third
part in an ongoing series on exploring performance issues of virtualizing HPC
applications. In the first part,
we described the setup and considered pure memory bandwidth using Stream. The
second part
considered the effect of network latency in a scientific application (NAMD) that ran across several virtual
machines.  Here we look at two of the tests in the HPC Challenge
(HPCC) suite:  StarRandomAccess and HPL.
While certainly not spanning all possible memory access patterns found in HPC apps, these
two tests are very different from each other and should help to give bounds on
virtualization overhead related to these patterns.

adds indirection to memory page table mappings: in addition to the logical-to-physical
mappings maintained by the OS (either native or in a VM), the hypervisor must
maintain guest physical-to-machine mappings. A straightforward implementation
of both mappings in software would result in enormous overhead. Prior to the
introduction of hardware MMU features in Intel (EPT) and AMD (RVI) processors, the
performance problem was solved through the use of “shadow” page tables. These
collapsed the two mappings to one so that the processor TLB cache could be used efficiently;
however, updating shadow page tables is expensive. With EPT and RVI, both
mappings are cached in the TLB, eliminating the need for shadow page tables. The
trade-off is that a TLB miss can be expensive: the cost is not just double the
cost of a miss in a conventional TLB; it is the square of the number of
steps in the TLB page walk. This cost can be reduced by using large memory
pages (2MB in x86_64) which typically need four steps in the TLB, rather than
small pages (4KB) which need five. This overview is highly simplified; see the
performance RVI and
whitepapers for much more detail about MMU virtualization, as well as results
from several benchmarks representing enterprise applications. Here we extend the
EPT paper to HPC apps running on a current version of vSphere.

there are certainly exceptions, two memory characteristics are common to HPC
applications: a general lack of page table manipulation, and heavy use of
memory itself. Memory is allocated once (along with the associated page tables)
and used for a long time. This use can either be dominated by sequential
accesses (running through an array), or by random accesses. The latter will put
more stress on the TLB. Common enterprise apps are often the opposite: much heavier page
table activity but lighter memory usage. Thus HPC apps do not benefit much from
the elimination of shadow page tables (this alone made many enterprise apps run
close to native performance as shown in the above papers), but may be sensitive
to the costs of TLB misses.

These points
are illustrated by two tests from the HPCC suite. StarRandomAccess is a
relatively simple microbenchmark that continuously accesses random memory
addresses. HPL is a standard floating-point linear algebra benchmark that
accesses memory more sequentially. For these tests, version 1.4.1 of HPCC was
used on RHEL 5.5 x86_64. Hyper-threading was disabled in the BIOS and all work
was limited to a single socket (automatically in the virtual cases and forced
with numactl for native). In this way, the effects of differences between native
and virtual in how HT and NUMA are treated were eliminated.
For virtual, a 4-vCPU VM with 22GB was used on a lab version (build 294208) of
ESX 4.1.  The relevant HPCC parameters are N=40000, NB=100,
P=Q=2, and np=4.
These values ensure all the CPU resources and nearly all the available memory
of one socket was consumed, thereby minimizing memory cache effects. The
hardware is the same as in the first part of this series. In particular, Xeon
X5570 processors with EPT are used.

results for StarRandomAccess are shown in Table 1. The metric GUP/s is billions
of updates per second, a measure of memory bandwidth. Small/large pages refers
to memory allocation in the OS and application. For virtual, ESX always backs
guest memory with large pages, if possible (as it is here). The default case
(EPT enabled, small pages in the guest) achieves only about 85% of native
throughput.  For an application with essentially no I/O or privileged
instructions that require special handling by the hypervisor, this is
surprisingly poor at first glance. However, this is a direct result of the
hardware architecture needed to avoid shadow page tables. Disabling EPT results
in near-native performance because, now, the TLB costs are essentially the same as for
native and the software MMU costs are minimal. TLB costs are still substantial
as seen by the effect of using large pages in native and the guest OS: more
than doubling of the performance. The virtualization overhead is reduced to
manageable levels, although there is still a 2% benefit from disabling EPT.

1.  StarRandomAccess throughput, GUP/s (ratio to native)

  Native Virtual
EPT on EPT off
Small pages 0.01842 0.01561 (0.848) 0.01811 (0.983)
Large pages 0.03956 0.03805 (0.962) 0.03900 (0.986)

Table 2
shows throughput results for HPL. The metric Gflops/s is billions of floating
point operations per second. Memory is largely accessed sequentially, greatly
reducing the stress on the TLB and the effect of large pages. Large pages
improve virtual performance by 4%, but improve native performance by less than 2%.
Disabling EPT improves virtual performance by only 0.5%. It is not clear why
virtual is slightly faster than native in the large pages case; this will be
investigated further.

Table 2.  HPL throughput, Gflop/s (ratio to native)

  Native Virtual
EPT on EPT off
Small pages 37.04 36.04 (0.973) 36.22 (0.978)
Large pages 37.74 38.24 (1.013) 38.42 (1.018)

While hardware MMU virtualization with Intel EPT and AMD RVI has been a huge benefit for many applications, these test results
support the expectation that the benefit for HPC apps is smaller, and can even increase overhead in some cases. However, the example shown
here where the latter is true is a microbenchmark that focuses on the worst case for this technology. Most HPC apps will not have so
many random memory accesses, so the effect of EPT is likely to be small.


Virtualizing SQL Server-based vCenter database – Performance Study

vSphere is an industry-leading virtualization platform that enables customers to build private clouds for running enterprise applications such as SQL server databases. Customers can expect near-native performance from their virtualized SQL databases when running in a vSphere environment. VMware vCenter Server, the management component of vSphere, uses a database to store and organize information related to vSphere-based virtual environments. This database can be implemented using SQL server. Based on the previous VMware performance studies involving SQL databases, it is reasonable to expect the performance of a virtualized SQL Server-based vCenter database to be similar to that in native.

A study was conducted in the VMware performance engineering lab to validate the assumption. The results of the study show that:

  • The most resource-intensive operations of a virtualized SQL Server-based vCenter database perform at a level comparable to that in native environment.
  • A SQL Server-based vCenter database managing a vSphere virtual environment of any scale can be virtualized on vSphere.
  • SQL databases, in general, perform at a near-native level when virtualized on vSphere 4.1.

Complete details of the experiments and their results can be found in this technical document.

For comments or questions on this article, please join me at voiceforvirtual.com.