Home > Blogs > VMware VROOM! Blog > Monthly Archives: September 2010

Monthly Archives: September 2010

Performance of Enterprise Java Applications on VMware vSphere 4.1 and SpringSource tc Server

VMware recently released a whitepaper presenting the results of a performance investigation using a representative enterprise-level Java application on VMware vSphere 4.1. The results of the tests discussed in that paper show that enterprise-level Java applications can provide excellent performance when deployed on VMware vSphere 4.1.  The main topics covered by the paper are a comparison of virtualized and native performance, and an examination of scale-up versus scale-out tradeoffs.

The paper first covers a set of tests that were performed to determine whether an enterprise-level Java application virtualized on VMware vSphere 4.1 can provide equivalent performance to a native deployment configured with the same memory and compute resources.  The tests used response-time as the primary metric for comparing the performance of native and virtualized deployments. The results show that at CPU utilization levels commonly found in real deployments, the native and virtual response-times are close enough to provide an essentially identical user-experience.  Even at peak load, with CPU utilization near the saturation point, the peak throughput of the virtualized application was within 90% of the native deployment.

The paper then discusses the results of an investigation of the performance impact of scaling-up the configuration of a single VM (adding more vCPUs) versus scaling-out to deploy the application on multiple smaller VMs. At loads below 80% CPU utilization, the response-times of scale-up and scale-out configurations using the same number of total vCPUs were effectively equivalent.  At higher loads, the peak-throughput results for the different configurations were also similar, with a slight advantage to scale-out configurations. 

The application used in these tests was Olio, a multi-tier enterprise application which implements a complete social-networking website.  Olio was deployed on SpringSource tc Server, running both natively and virtualized on vSphere 4.1.  

For more information, please read the full paper at http://www.vmware.com/resources/techresources/10158.  In addition, the author will be publishing additional results on his blog at http://communities.vmware.com/blogs/haroldr.

HPC Application Performance on ESX 4.1: NAMD

This is the second part in an on-going series on exploring performance issues of
virtualizing HPC applications. In the first href="/performance/2010/09/hpc-application-performance-on-esx-41-stream.html">part
we described the setup and considered memory bandwidth. Here we look at network
latency in the context of a single application. Evaluating the effect of network
latency in general is far more difficult since HPC apps range from ones needing
micro-second latency to embarrassingly-parallel apps that work well on slow
networks. NAMD is a molecular dynamics code that is
definitely not embarrassingly parallel but is known to run fine over 1 GbE TCP/IP, at least for small clusters. As such it
represents the network requirements of an important class of HPC
apps.


NAMD is a molecular dynamics application used to investigate the properties of large
molecules. NAMD supports both shared-memory parallelism and multiple-machine
parallelism using TCP/IP. The native results use up to 16 processes on a single
machine (“local” mode). Future work will use multiple machines, but some idea of
the performance issues involved can be obtained by running multiple VMs in
various configurations on the same physical host. The benchmark consists of
running 500 steps of the Satellite Tobacco Mosaic
Virus
. “STMV” consists of slightly over 1 million atoms, which is large enough to enable good
scaling on fairly large clusters. Shown below are elapsed time measurements for
various configurations. Each is an average of 3 runs and the repeatability is
good. The virtual NIC is e1000 for all the virtualized
cases.

An
apples-to-apples comparison between native and virtual is obtained by disabling
HT and using a single 8-vCPU VM. The VM is configured
with 12GB and default ESX parameters are used. With no networking, the virtual
overhead is just 1% as shown in Table 1.


Table 1. NAMD elapsed time, STMV molecule, HT disabled


cellSpacing=0 cellPadding=0 border=1>


vAlign=top width=96 rowSpan=2> 
width=192 colSpan=2>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>Total
processes



width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>4

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>8



vAlign=top width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">Native

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>1748

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>915


vAlign=top width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">1 VM

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>1768

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center> style="FONT-SIZE: 12pt">926

The effect of splitting the application across multiple machines and using different
network configurations can be tested in a virtual environment. For these tests
HT is enabled to get the full performance of the machine. The single VM case is
configured as above. The 2-VM cases are configured with 12GB, 8 vCPUs, and preferHT=1 (so each VM
can be scheduled on a NUMA node). The 4-VM cases have 6GB, 4 vCPUs, and preferHT=0. When
multiple VMs communicate using the same vSwitch, ESX
handles all the traffic in memory. For the multiple vSwitch cases, each vSwitch is
associated with a physical NIC which is connected to a physical switch. Since all networking traffic must go
through this switch, this configuration will be the same as using multiple hosts
in terms of inter-VM communication latencies. An overview of vSwitches and networking in ESX is available href="http://www.vmware.com/pdf/vsphere4/r41/vsp_41_esx_server_config.pdf">here.


Table 2. NAMD elapsed time, STMV molecule, HT enabled


cellSpacing=0 cellPadding=0 border=1>


vAlign=top width=139 rowSpan=2> 
width=96 colSpan=3>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>Total
processes



width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>4

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>8

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>16



vAlign=top width=139>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">Native

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>1761

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>1020

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>796



vAlign=top width=139>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">1 VM

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>1766

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>923

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>-



vAlign=top width=139>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">2 VMs, 1 class=SpellE>vSwitch

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>1779

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>928

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>787



vAlign=top width=139>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">2 VMs, 2 class=SpellE>vSwitches

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>1800

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>965

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>806



vAlign=top width=139>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">4 VMs, 1 class=SpellE>vSwitch

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>1774

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>940

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>810



vAlign=top width=139>

style="FONT-SIZE: 12pt">4 VMs, 4 class=SpellE>vSwitches

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>1885

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>1113

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center> style="FONT-SIZE: 12pt">903

The single VM case shows that HT has little effect on ESX performance when the extra
logical processors are not used. However, HT does slow down the native 8 process
case significantly. This appears to be due to Linux not scheduling one process
per core when it has the opportunity, which the ESX scheduler does by default.
Scalability from 4 to 8 processes for the single class=SpellE>vSwitch cases is close to 1.9X, and from 8 to 16 processes
(using the same number of cores, but taking advantage of HT) it is 1.17X. This
is excellent scaling. Networking over the switch reduces the performance
somewhat, especially for four vSwitches. Scaling for
native is hurt because the application does not manage NUMA resources itself,
and Linux is limited by how well it can do this. This allows one of the
16-process virtualized cases to be slightly faster than native, despite the
virtualization and multiple-machine overheads. The 16-process cases have the
best absolute performance, and therefore correspond to how NAMD would actually
be configured in practice. Here, the performance of all the virtualized cases is
very close to native, except for the 4-vSwitch case where the extra overhead of
networking has a significant effect. This is expected and should not be compared
to the native case since the virtual case models four hosts. We plan to
investigate multiple-host scaling soon to enable a direct comparison. A useful
simulation needs up to 10 million steps, which would only be practical on a
large cluster and only if all the software components scale very well.


These tests
show that a commonly-used molecular dynamics application can be virtualized on a
single host with little or no overhead. This particular app is representative of
HPC workloads with moderate networking requirements. Simulating four separate
hosts by forcing networking to go outside the box causes a slowdown of about
12%, but it is likely the corresponding native test will see some slowdown as
well. We plan to expand the testing to multiple hosts and to continue to search
for workloads that test the boundaries of what is possible in a virtualized
environment.

HPC Application Performance on ESX 4.1: Stream






Recently VMware has seen increased interest in migrating High Performance Computing (HPC)
applications to virtualized environments. This is due to the many advantages
virtualization brings to HPC, including consolidation, support for heterogeneous
OSes, ease of application development, security, job
migration, and cloud computing (all described href="http://communities.vmware.com/community/cto/high-performance">here). Currently some subset of HPC
applications virtualize well from a performance perspective. Our long-term goal
is to extend this to all HPC apps, realizing that large-scale apps with the
lowest latency and highest bandwidth requirements will be the most challenging.
Users who run HPC apps are traditionally very sensitive to performance overhead,
so it is important to quantify the performance cost of virtualization and
properly weigh it against the advantages. Compared to commercial apps
(databases, web servers, and so on), which are VMware’s bread-and-butter, HPC
apps place their own set of requirements on the platform
(OS/hypervisor/hardware) in order to execute well. Two common ones are
low-latency networking (since a single app is often spread across a cluster of
machines) and high memory bandwidth. This article is the first in a series that
will explore these and other aspects of HPC performance. Our goal will always be
to determine what works, what doesn’t, and how to get more of the former. The
benchmark reported on here is href="http://www.cs.virginia.edu/stream/ref.html">Stream, which is a standard tool designed
to measure memory bandwidth. It is a “worst case” micro-benchmark; real
applications will not achieve higher memory bandwidth.

Configuration

All tests were performed on an HP DL380 with two Intel X5570 processors, 48 GB memory (12
× 4 GB DIMMs), and four 1-GbE NICs (Intel Pro/1000 PT Quad Port Server Adapter)
connected to a switch. Guest and native OS is RHEL 5.5 x86_64. Hyper-threading
is enabled in the BIOS, so 16 logical processors are available. Processors and
memory are split between two NUMA nodes. A pre-GA lab version of ESX 4.1 was
used, build 254859.

Test Results

The OpenMP version of Stream is used. It is built using a
compiler switch as follows:


gcc -O2 -fopenmp stream.c -o stream


The number of simultaneous threads is controlled by an environment
variable:


export OMP_NUM_THREADS=8


The array size (N) and number of iterations (NTIMES) are hard-wired in the code as
N=108 (for a single machine) and NTIMES=40. The large array size
ensures that the processor cache provides little or no benefit. Stream reports
maximum memory bandwidth performance in MB/sec for four tests: copy, scale, add,
and triad (see the above link for descriptions of these). M stands for 1
million, not 220. Here are the native results, as a function of the
number of threads:


Table 1. Native memory bandwidth, MB/s


cellSpacing=0 cellPadding=0 border=1>


vAlign=top width=96 rowSpan=2> 
width=96 colSpan=5>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center> style="FONT-SIZE: 12pt">Threads



width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>1

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>2

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>4

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>8

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>16



vAlign=top width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">Copy

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>6388

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>12163

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>20473

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>26957

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center> style="FONT-SIZE: 12pt">26312



vAlign=top width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">Scalar

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>5231

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>10068

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>17208

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>25932

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center> style="FONT-SIZE: 12pt">26530



vAlign=top width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">Add

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>7070

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>13274

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>21481

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>29081

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center> style="FONT-SIZE: 12pt">29622



vAlign=top width=96>

style="FONT-SIZE: 12pt">Triad

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>6617

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>12505

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>21058

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>29328

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center> style="FONT-SIZE: 12pt">29889

Note that the scaling starts to fall off after two threads and the memory links are
essentially saturated at 8 threads. This is one reason why HPC apps often do not
see much benefit from enabling Hyper-Threading. To achieve the maximum aggregate
memory bandwidth in a virtualized environment, two virtual machines (VMs) with 8
vCPUs each were used. This is appropriate only for
modeling apps that can be split across multiple machines. One instance of stream
with N=5×107 was run in each VM simultaneously so the total amount of
memory accessed was the same as in the native test. The advanced configuration
option preferHT=1 is used (see below). Bandwidths
reported by the VMs are summed to get the total. The results are shown in Table
2: just slightly greater bandwidth than for the corresponding native case.


Table 2. Virtualized total memory bandwidth, MB/s, 2 VMs, preferHT=1


cellSpacing=0 cellPadding=0 border=1>


vAlign=top width=96 rowSpan=2> 
width=96 colSpan=4>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>Total
threads



width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>2

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>4

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>8

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>16



vAlign=top width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">Copy

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>12535

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>22526

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>27606

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center> style="FONT-SIZE: 12pt">27104



vAlign=top width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">Scalar

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>10294

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>18824

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>26781

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center> style="FONT-SIZE: 12pt">26537



vAlign=top width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">Add

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>13578

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>24182

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>30676

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center> style="FONT-SIZE: 12pt">30537



vAlign=top width=96>

style="FONT-SIZE: 12pt">Triad

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>13070

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>23476

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>30449

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center> style="FONT-SIZE: 12pt">30010

It is apparent that the Linux “first-touch” scheduling algorithm together with the
simplicity of the Stream algorithm are enough to ensure that nearly all memory
accesses in the native tests are “local” (that is, the processor each thread
runs on and the memory it accesses both belong to the same NUMA node). In ESX
4.1 NUMA information is not passed to the guest OS and (by default) 8-vCPU VMs
are scheduled across NUMA nodes in order to take advantage of more physical
cores. This means that about half of memory accesses will be “remote” and that
in the default configuration one or two VMs must produce significantly less
bandwidth than the native tests. Setting preferHT=1
tells the ESX scheduler to count logical processors (hardware threads) instead
of cores when determining if a given VM can fit on a NUMA node. In this case
that forces both memory and CPU of an 8-vCPU VM to be scheduled on a single NUMA
node. This guarantees all memory accesses are local and the aggregate bandwidth
of two VMs can equal or exceed native bandwidth. Note that a single VM cannot
match this bandwidth. It will get either half of it (because it’s using the
resources of only one NUMA node), or about 70% (because half the memory accesses
are remote). In both native and virtual environments, the maximum bandwidth of
purely remote memory accesses is about half that of purely local. On machines
with more NUMA nodes, remote memory bandwidth may be less and the importance of
memory locality even greater.

Summary

In both
native and virtualized environments, equivalent maximum memory bandwidth can be
achieved as long as the application is written or configured to use only local
memory. For native this means relying on the Linux “first-touch” scheduling
algorithm (for simple apps) or implementing explicit mechanisms in the code
(usually difficult if the code wasn’t designed for NUMA). For virtual a
different mindset is needed: the application needs to be able to run across
multiple machines, with each VM sized to fit on a NUMA node. On machines with
hyper-threading enabled, preferHT=1 needs to be set
for the larger VMs. If these requirements can be met, then a valuable feature of
virtualization is that the app needs to have no NUMA awareness at all; NUMA
scheduling is taken care of by the hypervisor (for all apps, not just for those
where Linux is able to align threads and memory on the same NUMA node). For
those apps where these requirements can’t be met (ones that need a large single
instance OS), current development focus is on relaxing these requirements so
they are more like native, while retaining the above advantage for small
VMs.