Home > Blogs > VMware VROOM! Blog > Author Archives: Jeffrey Buell

Virtualized Hadoop Performance with vSphere 5.1

In an earlier paper on a small seven-host cluster it was shown that Hadoop can be virtualized with little overhead, and that better-than-native performance can be achieved with the right configuration. However, the reasons for the observed performance behavior were not well understood. Recently, this work was refreshed with a larger cluster of 32 high-performance hosts running VMware vSphere® 5.1. The performance of native and several virtual configurations was compared for three applications. The apples-to-apples case of a single virtual machine per host shows performance close to that of native. Improvements in elapsed time of up to 13% for the most important application (TeraSort) can be achieved by partitioning each host into two or four virtual machines, resulting in competitive or even better than native performance as shown in the figure below (number of VMs is per host, and a lower ratio is better). Details of the results are in a new whitepaper: “Virtualized Hadoop Performance with VMware vSphere 5.1“. The paper also discusses the use of several performance tools and models to gain a better understanding of both the sources of virtualization overhead and the reasons why configuring multiple smaller virtual machines per host can enhance performance. Based on this, recommendations for optimal hardware and software configuration are also given.

Protecting Hadoop with VMware vSphere 5 Fault Tolerance

Hadoop provides a platform for building distributed systems for massive data storage and analysis. It has internal mechanisms to replicate user data and to tolerate many kinds of hardware and software failures. However, like many other distributed systems, Hadoop has a small number of Single Points of Failure (SPoFs). These include the NameNode (which manages the Hadoop Distributed Filesystem namespace and keeps track of storage blocks), and the JobTracker (which schedules jobs and the map and reduce tasks that make up each job). VMware vSphere Fault Tolerance (FT) can be used to protect virtual machines that run these vulnerable components of a Hadoop cluster. Recently a cluster of 24 hosts was used to run three different Hadoop applications to show that such protection has only a small impact on application performance. Various Hadoop configurations were employed to artificially create greater load on the NameNode and JobTracker daemons. With conservative extrapolation, these tests show that uniprocessor virtual machines with FT enabled are sufficient to run the master daemons for clusters of more than 200 hosts.

A new white paper, “Protecting Hadoop with VMware vSphere 5 Fault Tolerance,” is now available in which these tests are described in detail. CPU and network utilization of the protected VMs are given to enable comparisons with other distributed applications. In addition, several best practices are suggested to maximize the size of the Hadoop cluster that can be protected with FT.

Virtualized Hadoop Performance on vSphere 5

In recent years the amount of data stored worldwide has exploded. This has led to the birth of the term 'Big Data'. While the scale of data brings with it complexity associated with storing and handling it, these large datasets are known to have business information buried in them that is critical to continued growth and success.  The last few years have seen the birth of several new tools to manage and analyze such large datasets in a timely way (where traditional tools have had limitations). A natural question to ask is how these tools perform on vSphere. As the start of an ongoing effort to qauntify the performance of big data tools on vSphere, we've chosen to test one of the more popular tools – Hadoop.

Hadoop has emerged as a popular platform for the distributed processing of data. It scales to thousands of nodes while maintaining resiliency to disk, node, or even rack failure. It can use any storage, but is most often used with local disks. A whitepaper giving an overview of Hadoop and the details of tests on commondity hardware with local storage is available here.  One of the findings in the paper is that running 2 or 4 smaller VMs per physical machine usually resulted in better performance, often exceeding native performance.

As we continue our performance testing, stay tuned for results on a larger cluster with bigger data, with other Big Data tools, and on shared storage.

HPC Application Performance on ESX 4.1: Memory Virtualization





This is the third
part in an ongoing series on exploring performance issues of virtualizing HPC
applications. In the first part,
we described the setup and considered pure memory bandwidth using Stream. The
second part
considered the effect of network latency in a scientific application (NAMD) that ran across several virtual
machines.  Here we look at two of the tests in the HPC Challenge
(HPCC) suite:  StarRandomAccess and HPL.
While certainly not spanning all possible memory access patterns found in HPC apps, these
two tests are very different from each other and should help to give bounds on
virtualization overhead related to these patterns.

Virtualization
adds indirection to memory page table mappings: in addition to the logical-to-physical
mappings maintained by the OS (either native or in a VM), the hypervisor must
maintain guest physical-to-machine mappings. A straightforward implementation
of both mappings in software would result in enormous overhead. Prior to the
introduction of hardware MMU features in Intel (EPT) and AMD (RVI) processors, the
performance problem was solved through the use of “shadow” page tables. These
collapsed the two mappings to one so that the processor TLB cache could be used efficiently;
however, updating shadow page tables is expensive. With EPT and RVI, both
mappings are cached in the TLB, eliminating the need for shadow page tables. The
trade-off is that a TLB miss can be expensive: the cost is not just double the
cost of a miss in a conventional TLB; it is the square of the number of
steps in the TLB page walk. This cost can be reduced by using large memory
pages (2MB in x86_64) which typically need four steps in the TLB, rather than
small pages (4KB) which need five. This overview is highly simplified; see the
performance RVI and
EPT
whitepapers for much more detail about MMU virtualization, as well as results
from several benchmarks representing enterprise applications. Here we extend the
EPT paper to HPC apps running on a current version of vSphere.

Although
there are certainly exceptions, two memory characteristics are common to HPC
applications: a general lack of page table manipulation, and heavy use of
memory itself. Memory is allocated once (along with the associated page tables)
and used for a long time. This use can either be dominated by sequential
accesses (running through an array), or by random accesses. The latter will put
more stress on the TLB. Common enterprise apps are often the opposite: much heavier page
table activity but lighter memory usage. Thus HPC apps do not benefit much from
the elimination of shadow page tables (this alone made many enterprise apps run
close to native performance as shown in the above papers), but may be sensitive
to the costs of TLB misses.

These points
are illustrated by two tests from the HPCC suite. StarRandomAccess is a
relatively simple microbenchmark that continuously accesses random memory
addresses. HPL is a standard floating-point linear algebra benchmark that
accesses memory more sequentially. For these tests, version 1.4.1 of HPCC was
used on RHEL 5.5 x86_64. Hyper-threading was disabled in the BIOS and all work
was limited to a single socket (automatically in the virtual cases and forced
with numactl for native). In this way, the effects of differences between native
and virtual in how HT and NUMA are treated were eliminated.
For virtual, a 4-vCPU VM with 22GB was used on a lab version (build 294208) of
ESX 4.1.  The relevant HPCC parameters are N=40000, NB=100,
P=Q=2, and np=4.
These values ensure all the CPU resources and nearly all the available memory
of one socket was consumed, thereby minimizing memory cache effects. The
hardware is the same as in the first part of this series. In particular, Xeon
X5570 processors with EPT are used.

Throughput
results for StarRandomAccess are shown in Table 1. The metric GUP/s is billions
of updates per second, a measure of memory bandwidth. Small/large pages refers
to memory allocation in the OS and application. For virtual, ESX always backs
guest memory with large pages, if possible (as it is here). The default case
(EPT enabled, small pages in the guest) achieves only about 85% of native
throughput.  For an application with essentially no I/O or privileged
instructions that require special handling by the hypervisor, this is
surprisingly poor at first glance. However, this is a direct result of the
hardware architecture needed to avoid shadow page tables. Disabling EPT results
in near-native performance because, now, the TLB costs are essentially the same as for
native and the software MMU costs are minimal. TLB costs are still substantial
as seen by the effect of using large pages in native and the guest OS: more
than doubling of the performance. The virtualization overhead is reduced to
manageable levels, although there is still a 2% benefit from disabling EPT.

Table
1.  StarRandomAccess throughput, GUP/s (ratio to native)

  Native Virtual
EPT on EPT off
Small pages 0.01842 0.01561 (0.848) 0.01811 (0.983)
Large pages 0.03956 0.03805 (0.962) 0.03900 (0.986)

Table 2
shows throughput results for HPL. The metric Gflops/s is billions of floating
point operations per second. Memory is largely accessed sequentially, greatly
reducing the stress on the TLB and the effect of large pages. Large pages
improve virtual performance by 4%, but improve native performance by less than 2%.
Disabling EPT improves virtual performance by only 0.5%. It is not clear why
virtual is slightly faster than native in the large pages case; this will be
investigated further.

Table 2.  HPL throughput, Gflop/s (ratio to native)

  Native Virtual
EPT on EPT off
Small pages 37.04 36.04 (0.973) 36.22 (0.978)
Large pages 37.74 38.24 (1.013) 38.42 (1.018)

While hardware MMU virtualization with Intel EPT and AMD RVI has been a huge benefit for many applications, these test results
support the expectation that the benefit for HPC apps is smaller, and can even increase overhead in some cases. However, the example shown
here where the latter is true is a microbenchmark that focuses on the worst case for this technology. Most HPC apps will not have so
many random memory accesses, so the effect of EPT is likely to be small.

 


HPC Application Performance on ESX 4.1: NAMD

This is the second part in an on-going series on exploring performance issues of
virtualizing HPC applications. In the first href="/performance/2010/09/hpc-application-performance-on-esx-41-stream.html">part
we described the setup and considered memory bandwidth. Here we look at network
latency in the context of a single application. Evaluating the effect of network
latency in general is far more difficult since HPC apps range from ones needing
micro-second latency to embarrassingly-parallel apps that work well on slow
networks. NAMD is a molecular dynamics code that is
definitely not embarrassingly parallel but is known to run fine over 1 GbE TCP/IP, at least for small clusters. As such it
represents the network requirements of an important class of HPC
apps.


NAMD is a molecular dynamics application used to investigate the properties of large
molecules. NAMD supports both shared-memory parallelism and multiple-machine
parallelism using TCP/IP. The native results use up to 16 processes on a single
machine (“local” mode). Future work will use multiple machines, but some idea of
the performance issues involved can be obtained by running multiple VMs in
various configurations on the same physical host. The benchmark consists of
running 500 steps of the Satellite Tobacco Mosaic
Virus
. “STMV” consists of slightly over 1 million atoms, which is large enough to enable good
scaling on fairly large clusters. Shown below are elapsed time measurements for
various configurations. Each is an average of 3 runs and the repeatability is
good. The virtual NIC is e1000 for all the virtualized
cases.

An
apples-to-apples comparison between native and virtual is obtained by disabling
HT and using a single 8-vCPU VM. The VM is configured
with 12GB and default ESX parameters are used. With no networking, the virtual
overhead is just 1% as shown in Table 1.


Table 1. NAMD elapsed time, STMV molecule, HT disabled


cellSpacing=0 cellPadding=0 border=1>


vAlign=top width=96 rowSpan=2> 
width=192 colSpan=2>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>Total
processes



width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>4

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>8



vAlign=top width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">Native

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>1748

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>915


vAlign=top width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">1 VM

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>1768

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center> style="FONT-SIZE: 12pt">926

The effect of splitting the application across multiple machines and using different
network configurations can be tested in a virtual environment. For these tests
HT is enabled to get the full performance of the machine. The single VM case is
configured as above. The 2-VM cases are configured with 12GB, 8 vCPUs, and preferHT=1 (so each VM
can be scheduled on a NUMA node). The 4-VM cases have 6GB, 4 vCPUs, and preferHT=0. When
multiple VMs communicate using the same vSwitch, ESX
handles all the traffic in memory. For the multiple vSwitch cases, each vSwitch is
associated with a physical NIC which is connected to a physical switch. Since all networking traffic must go
through this switch, this configuration will be the same as using multiple hosts
in terms of inter-VM communication latencies. An overview of vSwitches and networking in ESX is available href="http://www.vmware.com/pdf/vsphere4/r41/vsp_41_esx_server_config.pdf">here.


Table 2. NAMD elapsed time, STMV molecule, HT enabled


cellSpacing=0 cellPadding=0 border=1>


vAlign=top width=139 rowSpan=2> 
width=96 colSpan=3>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>Total
processes



width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>4

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>8

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>16



vAlign=top width=139>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">Native

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>1761

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>1020

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>796



vAlign=top width=139>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">1 VM

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>1766

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>923

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>-



vAlign=top width=139>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">2 VMs, 1 class=SpellE>vSwitch

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>1779

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>928

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>787



vAlign=top width=139>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">2 VMs, 2 class=SpellE>vSwitches

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>1800

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>965

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>806



vAlign=top width=139>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">4 VMs, 1 class=SpellE>vSwitch

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>1774

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>940

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>810



vAlign=top width=139>

style="FONT-SIZE: 12pt">4 VMs, 4 class=SpellE>vSwitches

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>1885

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>1113

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center> style="FONT-SIZE: 12pt">903

The single VM case shows that HT has little effect on ESX performance when the extra
logical processors are not used. However, HT does slow down the native 8 process
case significantly. This appears to be due to Linux not scheduling one process
per core when it has the opportunity, which the ESX scheduler does by default.
Scalability from 4 to 8 processes for the single class=SpellE>vSwitch cases is close to 1.9X, and from 8 to 16 processes
(using the same number of cores, but taking advantage of HT) it is 1.17X. This
is excellent scaling. Networking over the switch reduces the performance
somewhat, especially for four vSwitches. Scaling for
native is hurt because the application does not manage NUMA resources itself,
and Linux is limited by how well it can do this. This allows one of the
16-process virtualized cases to be slightly faster than native, despite the
virtualization and multiple-machine overheads. The 16-process cases have the
best absolute performance, and therefore correspond to how NAMD would actually
be configured in practice. Here, the performance of all the virtualized cases is
very close to native, except for the 4-vSwitch case where the extra overhead of
networking has a significant effect. This is expected and should not be compared
to the native case since the virtual case models four hosts. We plan to
investigate multiple-host scaling soon to enable a direct comparison. A useful
simulation needs up to 10 million steps, which would only be practical on a
large cluster and only if all the software components scale very well.


These tests
show that a commonly-used molecular dynamics application can be virtualized on a
single host with little or no overhead. This particular app is representative of
HPC workloads with moderate networking requirements. Simulating four separate
hosts by forcing networking to go outside the box causes a slowdown of about
12%, but it is likely the corresponding native test will see some slowdown as
well. We plan to expand the testing to multiple hosts and to continue to search
for workloads that test the boundaries of what is possible in a virtualized
environment.

HPC Application Performance on ESX 4.1: Stream






Recently VMware has seen increased interest in migrating High Performance Computing (HPC)
applications to virtualized environments. This is due to the many advantages
virtualization brings to HPC, including consolidation, support for heterogeneous
OSes, ease of application development, security, job
migration, and cloud computing (all described href="http://communities.vmware.com/community/cto/high-performance">here). Currently some subset of HPC
applications virtualize well from a performance perspective. Our long-term goal
is to extend this to all HPC apps, realizing that large-scale apps with the
lowest latency and highest bandwidth requirements will be the most challenging.
Users who run HPC apps are traditionally very sensitive to performance overhead,
so it is important to quantify the performance cost of virtualization and
properly weigh it against the advantages. Compared to commercial apps
(databases, web servers, and so on), which are VMware’s bread-and-butter, HPC
apps place their own set of requirements on the platform
(OS/hypervisor/hardware) in order to execute well. Two common ones are
low-latency networking (since a single app is often spread across a cluster of
machines) and high memory bandwidth. This article is the first in a series that
will explore these and other aspects of HPC performance. Our goal will always be
to determine what works, what doesn’t, and how to get more of the former. The
benchmark reported on here is href="http://www.cs.virginia.edu/stream/ref.html">Stream, which is a standard tool designed
to measure memory bandwidth. It is a “worst case” micro-benchmark; real
applications will not achieve higher memory bandwidth.

Configuration

All tests were performed on an HP DL380 with two Intel X5570 processors, 48 GB memory (12
× 4 GB DIMMs), and four 1-GbE NICs (Intel Pro/1000 PT Quad Port Server Adapter)
connected to a switch. Guest and native OS is RHEL 5.5 x86_64. Hyper-threading
is enabled in the BIOS, so 16 logical processors are available. Processors and
memory are split between two NUMA nodes. A pre-GA lab version of ESX 4.1 was
used, build 254859.

Test Results

The OpenMP version of Stream is used. It is built using a
compiler switch as follows:


gcc -O2 -fopenmp stream.c -o stream


The number of simultaneous threads is controlled by an environment
variable:


export OMP_NUM_THREADS=8


The array size (N) and number of iterations (NTIMES) are hard-wired in the code as
N=108 (for a single machine) and NTIMES=40. The large array size
ensures that the processor cache provides little or no benefit. Stream reports
maximum memory bandwidth performance in MB/sec for four tests: copy, scale, add,
and triad (see the above link for descriptions of these). M stands for 1
million, not 220. Here are the native results, as a function of the
number of threads:


Table 1. Native memory bandwidth, MB/s


cellSpacing=0 cellPadding=0 border=1>


vAlign=top width=96 rowSpan=2> 
width=96 colSpan=5>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center> style="FONT-SIZE: 12pt">Threads



width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>1

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>2

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>4

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>8

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>16



vAlign=top width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">Copy

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>6388

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>12163

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>20473

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>26957

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center> style="FONT-SIZE: 12pt">26312



vAlign=top width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">Scalar

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>5231

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>10068

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>17208

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>25932

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center> style="FONT-SIZE: 12pt">26530



vAlign=top width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">Add

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>7070

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>13274

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>21481

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>29081

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center> style="FONT-SIZE: 12pt">29622



vAlign=top width=96>

style="FONT-SIZE: 12pt">Triad

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>6617

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>12505

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>21058

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>29328

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center> style="FONT-SIZE: 12pt">29889

Note that the scaling starts to fall off after two threads and the memory links are
essentially saturated at 8 threads. This is one reason why HPC apps often do not
see much benefit from enabling Hyper-Threading. To achieve the maximum aggregate
memory bandwidth in a virtualized environment, two virtual machines (VMs) with 8
vCPUs each were used. This is appropriate only for
modeling apps that can be split across multiple machines. One instance of stream
with N=5×107 was run in each VM simultaneously so the total amount of
memory accessed was the same as in the native test. The advanced configuration
option preferHT=1 is used (see below). Bandwidths
reported by the VMs are summed to get the total. The results are shown in Table
2: just slightly greater bandwidth than for the corresponding native case.


Table 2. Virtualized total memory bandwidth, MB/s, 2 VMs, preferHT=1


cellSpacing=0 cellPadding=0 border=1>


vAlign=top width=96 rowSpan=2> 
width=96 colSpan=4>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>Total
threads



width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>2

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>4

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>8

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>16



vAlign=top width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">Copy

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>12535

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>22526

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>27606

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center> style="FONT-SIZE: 12pt">27104



vAlign=top width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">Scalar

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>10294

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>18824

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>26781

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center> style="FONT-SIZE: 12pt">26537



vAlign=top width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">Add

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>13578

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>24182

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>30676

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center> style="FONT-SIZE: 12pt">30537



vAlign=top width=96>

style="FONT-SIZE: 12pt">Triad

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>13070

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>23476

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>30449

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center> style="FONT-SIZE: 12pt">30010

It is apparent that the Linux “first-touch” scheduling algorithm together with the
simplicity of the Stream algorithm are enough to ensure that nearly all memory
accesses in the native tests are “local” (that is, the processor each thread
runs on and the memory it accesses both belong to the same NUMA node). In ESX
4.1 NUMA information is not passed to the guest OS and (by default) 8-vCPU VMs
are scheduled across NUMA nodes in order to take advantage of more physical
cores. This means that about half of memory accesses will be “remote” and that
in the default configuration one or two VMs must produce significantly less
bandwidth than the native tests. Setting preferHT=1
tells the ESX scheduler to count logical processors (hardware threads) instead
of cores when determining if a given VM can fit on a NUMA node. In this case
that forces both memory and CPU of an 8-vCPU VM to be scheduled on a single NUMA
node. This guarantees all memory accesses are local and the aggregate bandwidth
of two VMs can equal or exceed native bandwidth. Note that a single VM cannot
match this bandwidth. It will get either half of it (because it’s using the
resources of only one NUMA node), or about 70% (because half the memory accesses
are remote). In both native and virtual environments, the maximum bandwidth of
purely remote memory accesses is about half that of purely local. On machines
with more NUMA nodes, remote memory bandwidth may be less and the importance of
memory locality even greater.

Summary

In both
native and virtualized environments, equivalent maximum memory bandwidth can be
achieved as long as the application is written or configured to use only local
memory. For native this means relying on the Linux “first-touch” scheduling
algorithm (for simple apps) or implementing explicit mechanisms in the code
(usually difficult if the code wasn’t designed for NUMA). For virtual a
different mindset is needed: the application needs to be able to run across
multiple machines, with each VM sized to fit on a NUMA node. On machines with
hyper-threading enabled, preferHT=1 needs to be set
for the larger VMs. If these requirements can be met, then a valuable feature of
virtualization is that the app needs to have no NUMA awareness at all; NUMA
scheduling is taken care of by the hypervisor (for all apps, not just for those
where Linux is able to align threads and memory on the same NUMA node). For
those apps where these requirements can’t be met (ones that need a large single
instance OS), current development focus is on relaxing these requirements so
they are more like native, while retaining the above advantage for small
VMs.


Scale-out of XenApp on ESX 3.5

In an earlier posting (Virtualizing XenApp on XenServer 5.0 and ESX 3.5) we looked at the performance of virtualizing a Citrix XenApp workload in a 2-vCPU VM in comparison to the native OS booted with two cores. This provided valuable data about the single-VM performance of XenApp running on ESX 3.5. In our next set of experiments we used the same workload, and the same hardware, but scaled out to 8 VMs. This is compared to the native OS booted with all 16 cores. We found that ESX has near-linear scaling as the number of VMs is increased, and that aggregate performance with 8 VMs is much better than native.

We expected the earlier single-VM approach to produce representative results because of the excellent scale-out performance of ESX 3.5. This is especially true on NUMA architectures where VMs running on different nodes are nearly independent in terms of CPU and memory resources. However, the same cannot be said for the scale-up performance (SMP scaling) of a single native machine, or a single VM. As for many other applications, virtualizing many relatively small XenApp servers on a single machine can overcome the inherent SMP performance limitations of XenApp on the same machine.

In the current experiments, each VM is the same as before, except the allocated memory is set to 6700 MB (the amount needed to run 30 users). Windows 2003 x64 was used in both the VMs and natively. See the above posting for more workload and configuration details. Shown below is the average aggregate latency as a function of the total number of users. Every data point shown is a separate run with about 4 hours of steady state execution. Each user performs six iterations where a complete set of 22 workload operations is performed during each iteration. The latency of these operations is summed to get the aggregate latency. The average is over the middle four iterations, all the users, and all the VMs.

Blog_multivm_esx_native  

In both the Native and ESX cases all 16 cores are being used (although with much less than 100% utilization). At very low load Native has somewhat better total latency, but beyond 80 users the latency quickly degrades. Starting at 140 users some of the sessions start to fail. 120 users is really the upper limit for running this workload on Native. With 8 VMs on ESX, 20 users per VM (160 total) was not a problem at all, so we pushed the load up to 240 total users. At this point the latency is getting high, but there were no failures and all of the desktop applications were still usable. The load has to be increased to more than 200 users on ESX before the latency exceeds that from 120 users on Native. That is, for a Quality-of-Service standard of 39 seconds aggregate latency, ESX supports 67% more users than Native. Like many commonly-deployed applications, XenApp has limited SMP scalability. Roughly speaking, its scalability is better than common web apps but not as good as well-tuned databases. When running this workload, XenApp scales well to 4 CPUs, but 8 CPUs is marginal and 16 CPUs is clearly too many. Dividing the load among smaller VMs avoids SMP scaling issues and allows the full capabilities of the hardware to be utilized.

Some would say that even 200 XenApp users are not very many for such a powerful machine. In any benchmark of this kind many decisions have to be made with regard to the choice of applications, operations within each application, and amount of “user think time” between operations. As we pointed out earlier, we strove to make realistic choices when designing the VDI workload. However, one may choose to model users performing less resource-intensive operations and thus be able to support more of them.

The scale-out performance of ESX is quantified in the second chart, which shows the total latency as a function of the number of VMs with all VMs running either a low (10 users), medium (20), or high (30) load each. Flat lines would indicate perfect scalability. They are actually nearly so for the each of the load cases up to 4 VMs. The latency increases noticeably only for 8 VMs, and then only for higher loads. This indicates that the increased application latency is mostly due to the increased memory latency caused by running 2 VMs per NUMA node (as opposed to at most a single VM per node for four or fewer VMs).
Blog_multivm_scale_out    

While our first blog showed how low the overhead is for running XenApp on a single 2-vCPU VM on ESX compared to a native OS booted with two CPUs, the current results fully utilizing the 16 core machine are even more compelling. These show the excellent scale-out performance of ESX on a modern NUMA machine, and that the aggregate performance of several VMs can far exceed the capabilities of a single native OS.

Virtualizing XenApp on XenServer 5.0 and ESX 3.5

There has always been interest in running Citrix XenApp (formerly Citrix Presentation Server) workloads on the VMware Virtual Infrastructure platform. With the advent of multi-core systems, purchasing decisions are driven towards systems with 4-16 cores. However, using this hardware effectively is difficult due to limited scaling of the XenApp application environment. In addition to the usual benefits of virtualization, these scaling issues make running XenApp environments on ESX even more compelling.

We
recently ran some performance tests to understand what can be expected
in terms of performance for a virtualized XenApp workload. The results
show that ESX runs common desktop applications on XenApp with
reasonable overhead compared to a native installation, and with
significantly better performance than XenServer. We hope this data will
help provide guidance when
XenApp environments are transitioned from physical hardware to a virtualized environment.

Together with partners, we have been developing a desktop workload for over a year. The workload has been tested extensively on virtual desktop infrastructure (VDI) environments with one user per virtual machine (VM). VDI results have been presented and published in numerous locations (e.g., http://www.vmware.com/resources/techresources/1085, VMworld 2008 presentation VD2505 with Dell-EqualLogic). Great attention was paid to selecting the most relevant applications as well as to specifying the right types and amount of work each should do. Many other Terminal Services-style benchmarks fail to be representative of actual desktop users. Porting the workload from a VDI environment to the XenApp environment was straightforward.

XenApp was run in a single 14 GB 2-vCPU Virtual Machine (VM) booted with Windows Server 2003 x64. The hypervisors used were ESX 3.5 U3 and XenServer 5. The VMs for both had the appropriate tools/drivers installed. The XenServer VM had the Citrix XenApp optimization enabled. For comparison, the tests were run natively with the OS restricted to the same hardware resources. The hardware is a HP DL585 with 4 quad-core 2210 MHz “Barcelona” processors and 64 GB memory. Rapid Virtualization Indexing (RVI) was enabled.

The test consists of 22 operations, always executed in the following order:

IE_OPEN_2

Open Internet Explorer

IE_ALBUM

Browse photos in IE

EXCEL_OPEN_2

Open Excel file

EXCEL_FORMULA

Evaluate formula in Excel

EXCEL_SAVE_2

Save Excel file

FIREFOX_OPEN

Open Firefox

FIREFOX_CLOSE

Close Firefox

ACROBAT_OPEN_1

Open PDF file

ACROBAT_BROWSE_1

Browse PDF file

PPT_OPEN

Open PowerPoint file

PPT_SLIDESHOW

Slideshow in PowerPoint

PPT_EDIT

Edit PowerPoint file

PPT_APPEND

Append to PowerPoint file

PPT_SAVE

Save PowerPoint file

WORD_OPEN_1

Open Word file

WORD_MODIFY_1

Modify Word file

WORD_SAVE_1

Save Word file

IE_OPEN_1

Open Internet Explorer

IE_APACHE

Browse Apache doc in IE

EXCEL_OPEN_1

Open Excel file

EXCEL_SORT

Sort column in Excel

EXCEL_SAVE_1

Save Excel file

 

A “sleep” of random length is inserted between each operation to simulate user think time. One execution of the whole set of operations is called an “iteration” and takes about 57 minutes. Several of these operations consist of many sub-operations. For instance, the PPT_SLIDESHOW operation consists of 10 sub-operations where each displays a slide in a PowerPoint document followed by a pause. Only the latency to display the slide is timed, and not the time spent sleeping. The latencies of the sub-operations are summed to give the operation latency, and all the operation latencies within one iteration of one user are summed to yield the “total latency”. AutoIt3, an open-source scripting language, is used on the server side to automate the operations. CSTK Client Launcher (a utility that allows the tester to create and launch multiple ICA client sessions) is used on a client machine to start the users (sessions). Each user is started in a staggered fashion so that the last user is starting when the first user is close to finishing its first iteration. This strategy avoids synchronizing the execution of any operation across users. Each user runs six iterations. The “average total latency” is determined by averaging all the total latencies across the middle four iterations (i.e., the ones where all users are running at steady state), and across all users. Note that it is important to time many different kinds of desktop applications: timing just a few operations (or even just one as has been done in other publications) can give a very distorted view of overall performance. With a similar philosophy we gather CPU data over nearly four hours of steady state to ensure the utilization statistics are solid. The first figure shows the average total latency as a function of the number of users for XenServer, ESX, and Native.

Blog_latency2

The two horizontal lines labeled “QoS” denote the Native latency for 35 and 38 users. Either of these may be considered as a reasonable maximum Quality of Service for latency. They correspond to somewhat less or more, respectively, of half of the available CPU resources (see the CPU figure below), which is a commonly used target for XenApp. At higher utilizations not only does the latency increase rapidly but operations may start to fail. We required that all operations succeed (just like a real user expects!) for a test to be deemed successful. The points where the QoS lines cross the ESX  and XenServer curves gives the number of users that can be supported with the same total latency. Normalizing with the number of Native users (35 or 38) gives the fraction of Native users each virtualization product can support at the given total latency:

Blog_users2

ESX consistently supports about 86% of the native number of users, while XenServer supports about 77%. Shown below is the average CPU utilization during the second to fifth iteration of the last user, given as a percentage of a single core. Perfmon was used for Native, esxtop for ESX, and xentop for XenServer. ESX uses less CPU than XenServer no matter how the comparison is made: for a given number of users, or for a given total latency:

Blog_cpu2

XenApp and other products that virtualize applications are prime candidates to be run in a VM. These results show that ESX can do so efficiently compared to using a physical machine. This was shown with a benchmark that: represents a real desktop workload, uses a metric that includes latencies of all operations, and requires that all operations complete successfully. Furthermore, ESX supports about 13% more users than XenServer at a given latency while using less CPU.

ESX Runs Java Virtual Machines with Near-Native Performance

Java workloads are becoming increasingly common-place in the datacenter, and consequently Java benchmarks are among the most frequently run server performance tests. In virtualized environments, Single-VM Java performance tests are common but there are a number of reasons why they are not interesting or particularly relevant to production systems. Primary among these is system administrators nearly always run multiple VMs to make better use of their multi-core systems. Benchmarking should reflect this. While ESX performs very well on single-VM tests, we show here that ESX also achieves very close to native performance for configurations that are both realistic and well-tuned.

There are several subtleties in comparing Java performance on native and virtualized platforms:

  • Java applications are often split into multiple Java Virtual Machines (JVMs). In particular, nearly all the published SPECjbb2005 (the most widely-used Java benchmark) tests are run this way, usually with 2 CPU cores per JVM. Since this is done in order to achieve best performance, the appropriate virtualized analog is to run each JVM in a separate virtual machine (VM), with 2 virtual CPUs (vCPUs) per VM. This also allows all the advantages of virtualization to be gained (scheduling flexibility, resource allocation, high availability, etc.).
  • A native machine is often booted with a reduced number of CPUs in order to compare with a small VM. However, such experiments are usually irrelevant because different subsets of the hardware are used for the various cases, or simply because fewer resources are used than is normal. It is better to fully utilize all CPUs for all tests.
  • Disabling cache line prefetching in the BIOS and enabling large pages in the OS typically help Java application performance greatly. These are recommended for both benchmarking and general production use. On the other hand, CPU affinity can improve benchmark performance significantly but is not used here (nor is it recommended for production) due to its inflexibility.

We performed a number of experiments to study the performance of SPECjbb2005 running natively and on VMware Infrastructure 3. We first ran SPECjbb2005 natively in multi-JVM mode on a 2 socket machine equipped with 3 GHz Clovertown quad-core processors (8 cores total). The OS is Windows Server 2008 x64 booted with all 32 GB and 8 CPUs of the machine. Four JVMs were used, each with a 3700 MB Java heap. Details of the software and hardware configurations are given below. The reported throughput is 236,789 bops, which compares well with published native scores on similar hardware of around 252,000 bops (the difference is mostly accounted for by the use of CPU affinity for the published results). See, for example, http://www.spec.org/jbb2005/results/res2007q4/. We focused on out-of-the-box performance so little effort was spent on tuning; however reasonable performance is necessary to make the results credible.

For the virtualized case we ran conforming single-JVM tests in each of 4 identical VMs on ESX 3.5 U1. Each VM was booted with 2 vCPUs and 5 GB memory but otherwise the software and virtual hardware configurations were the same as the native setup. As with the native multi-JVM case, subtests with 1 through 8 warehouses were run and performance for warehouses 2 – 4 were averaged for each JVM/VM (as per the run rules). Unlike the multi-JVM case, the synchronization of the tests across the 4 VMs is never perfect; however it was never worse than 5 seconds out of the 12 minute measurement interval (4 minutes per subtest). Plus, lack of synchronization cannot give a performance advantage to any one VM, since all VMs continue to run (warehouses 5 – 8) after the measurement interval. In the first figure, we present the individual scores of the VMs and compare them to the individual scores of the native JVMs plus their average (reported throughput divided by the number of JVMs).

4vm_6

Each of the VMs is just 2.2 – 2.5% slower than the average of the native JVMs. Furthermore, the ESX scheduler is able to automatically ensure that each VM gets essentially the same share of the computational resources; fairness is nearly perfect. The native scheduler has a hard time doing this with JVMs: the performance of individual JVMs ranges from 11% slower to 6% faster than the average. Using CPU affinity in the native case completely fixes the fairness issue as well as increases the performance by 5%.

So why not use CPU affinity all the time? While it is often helpful for benchmarks it is usually impractical to use it effectively in real production systems. Instead of 4 VMs/JVMs which fit very nicely on our 8 core machine, what if a different number is required for other reasons (isolation, etc.)? Then affinity makes no sense at all. A little “reality” is introduced here by repeating the above tests with 5 JVMs for Native and 5 VMs for ESX. No other changes were needed except that expected_peak_warehouse had to be manually set to 2 for the native case. The reported throughput for Native was 233,073 bops, which is only a 1.6% drop from the 4 JVM test.

5vm_3

The VMs running on ESX are 1.3 – 3.4% slower than the average of the native JVMs. The fairness of ESX is excellent. The performance of the native JVMs ranges from 27% faster to 30% slower than the same average. This poor fairness means the performance of individual JVMs is unpredictable, even though the overall performance is good.

These results show that ESX has very little overhead for this CPU-intensive Java application, in both fully-committed and over-committed scenarios. In addition, the ESX scheduler ensures excellent fairness across the VMs. This is important for predictability of the performance of individual VMs and allows more precise resource management.

Benchmark configuration

Application:             SPECjbb2005 version 1.07

4 or 5 JVMs for Native, single JVM per VM for ESX

expected_peak_warehouse = 2

Subtests: 1 – 8 warehouses, average throughput for 2 – 4 warehouses

Java:                       BEA JRockit R27.5 64 bit for Windows

Java options:          -Xmx3700m -Xms3700m -Xns3000m -XXaggressive -Xgc:genpar

-XXgcthreads=2 -XXthroughputCompaction –XxlazyUnlocking

-XXtlaSize:min=4k,preferred=512k -XXcallProfiling

OS:                           Windows Server 2008 x64 Enterprise Edition

Large pages enabled with ‘Lock pages in memory’ in Local Security Settings

Hardware:              HP DL380 G5, 2 sockets, Xeon X5365 (3 GHz Clovertown), 32 GB

BIOS:                       Disabled ‘Hardware Prefetcher’ and ‘Adjacent Cache Line Prefetch’

Hypervisor:            ESX Server 3.5.0 U1 build 90092

Virtual hardware:  2 CPUs, 5 GB memory

Networking Performance and Scaling in Multiple VMs

Last month we published a Tech Note summarizing networking throughput results using ESX Server 3.0.1 and XenEnterprise 3.2.0.  Multiple NICs were used in order to achieve the maximum throughput possible in a single uniprocessor VM.  While these results are very useful for evaluating the virtualization overhead of networking, a more common configuration is to spread the networking load across multiple VMs. We present results for multi-VM networking in a new paper just published.  Only a single 1 Gbps NIC is used per VM, but with up to four VMs running simultaneously. This simulates a consolidation scenario of several machines each with substantial, but not extreme, networking I/O.  Unlike the multi-NIC paper, there is no exact native analog, but we ran the same total load in a SMP native Windows machine for comparison.  The results are similar to the earlier ones: ESX stays close to native performance, achieving up to 3400 Mbps for the 4-VM case. XenEnterprise peaks at 3 VMs and falls off to 62-69% of the ESX throughput with 4 VMs.  According to the XenEnterprise documentation only three physical NICs are supported in the host, even though the UI let us configure and run four physical NICs without error or warning.  This is not surprising given the performance.  We then tried a couple of experiments (like making dom0 use more than 1 CPU) to fix the bottleneck, but only succeeded in further reducing the throughput.  The virtualization layer in ESX is always SMP, and together with a battle-tested scheduler and support for 32 e1000 NICs, scales to many heavily-loaded VMs. Let us know if you’re able to reach the limits of ESX networking!