Home > Blogs > VMware VROOM! Blog


HPC Application Performance on ESX 4.1: NAMD

This is the second part in an on-going series on exploring performance issues of
virtualizing HPC applications. In the first href="/performance/2010/09/hpc-application-performance-on-esx-41-stream.html">part
we described the setup and considered memory bandwidth. Here we look at network
latency in the context of a single application. Evaluating the effect of network
latency in general is far more difficult since HPC apps range from ones needing
micro-second latency to embarrassingly-parallel apps that work well on slow
networks. NAMD is a molecular dynamics code that is
definitely not embarrassingly parallel but is known to run fine over 1 GbE TCP/IP, at least for small clusters. As such it
represents the network requirements of an important class of HPC
apps.


NAMD is a molecular dynamics application used to investigate the properties of large
molecules. NAMD supports both shared-memory parallelism and multiple-machine
parallelism using TCP/IP. The native results use up to 16 processes on a single
machine (“local” mode). Future work will use multiple machines, but some idea of
the performance issues involved can be obtained by running multiple VMs in
various configurations on the same physical host. The benchmark consists of
running 500 steps of the Satellite Tobacco Mosaic
Virus
. “STMV” consists of slightly over 1 million atoms, which is large enough to enable good
scaling on fairly large clusters. Shown below are elapsed time measurements for
various configurations. Each is an average of 3 runs and the repeatability is
good. The virtual NIC is e1000 for all the virtualized
cases.

An
apples-to-apples comparison between native and virtual is obtained by disabling
HT and using a single 8-vCPU VM. The VM is configured
with 12GB and default ESX parameters are used. With no networking, the virtual
overhead is just 1% as shown in Table 1.


Table 1. NAMD elapsed time, STMV molecule, HT disabled


cellSpacing=0 cellPadding=0 border=1>


vAlign=top width=96 rowSpan=2> 
width=192 colSpan=2>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>Total
processes



width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>4

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>8



vAlign=top width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">Native

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>1748

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>915


vAlign=top width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">1 VM

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>1768

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center> style="FONT-SIZE: 12pt">926

The effect of splitting the application across multiple machines and using different
network configurations can be tested in a virtual environment. For these tests
HT is enabled to get the full performance of the machine. The single VM case is
configured as above. The 2-VM cases are configured with 12GB, 8 vCPUs, and preferHT=1 (so each VM
can be scheduled on a NUMA node). The 4-VM cases have 6GB, 4 vCPUs, and preferHT=0. When
multiple VMs communicate using the same vSwitch, ESX
handles all the traffic in memory. For the multiple vSwitch cases, each vSwitch is
associated with a physical NIC which is connected to a physical switch. Since all networking traffic must go
through this switch, this configuration will be the same as using multiple hosts
in terms of inter-VM communication latencies. An overview of vSwitches and networking in ESX is available href="http://www.vmware.com/pdf/vsphere4/r41/vsp_41_esx_server_config.pdf">here.


Table 2. NAMD elapsed time, STMV molecule, HT enabled


cellSpacing=0 cellPadding=0 border=1>


vAlign=top width=139 rowSpan=2> 
width=96 colSpan=3>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>Total
processes



width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>4

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>8

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>16



vAlign=top width=139>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">Native

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>1761

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>1020

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>796



vAlign=top width=139>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">1 VM

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>1766

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>923

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>-



vAlign=top width=139>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">2 VMs, 1 class=SpellE>vSwitch

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>1779

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>928

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>787



vAlign=top width=139>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">2 VMs, 2 class=SpellE>vSwitches

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>1800

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>965

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>806



vAlign=top width=139>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">4 VMs, 1 class=SpellE>vSwitch

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>1774

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>940

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>810



vAlign=top width=139>

style="FONT-SIZE: 12pt">4 VMs, 4 class=SpellE>vSwitches

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>1885

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>1113

width=96>
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center> style="FONT-SIZE: 12pt">903

The single VM case shows that HT has little effect on ESX performance when the extra
logical processors are not used. However, HT does slow down the native 8 process
case significantly. This appears to be due to Linux not scheduling one process
per core when it has the opportunity, which the ESX scheduler does by default.
Scalability from 4 to 8 processes for the single class=SpellE>vSwitch cases is close to 1.9X, and from 8 to 16 processes
(using the same number of cores, but taking advantage of HT) it is 1.17X. This
is excellent scaling. Networking over the switch reduces the performance
somewhat, especially for four vSwitches. Scaling for
native is hurt because the application does not manage NUMA resources itself,
and Linux is limited by how well it can do this. This allows one of the
16-process virtualized cases to be slightly faster than native, despite the
virtualization and multiple-machine overheads. The 16-process cases have the
best absolute performance, and therefore correspond to how NAMD would actually
be configured in practice. Here, the performance of all the virtualized cases is
very close to native, except for the 4-vSwitch case where the extra overhead of
networking has a significant effect. This is expected and should not be compared
to the native case since the virtual case models four hosts. We plan to
investigate multiple-host scaling soon to enable a direct comparison. A useful
simulation needs up to 10 million steps, which would only be practical on a
large cluster and only if all the software components scale very well.


These tests
show that a commonly-used molecular dynamics application can be virtualized on a
single host with little or no overhead. This particular app is representative of
HPC workloads with moderate networking requirements. Simulating four separate
hosts by forcing networking to go outside the box causes a slowdown of about
12%, but it is likely the corresponding native test will see some slowdown as
well. We plan to expand the testing to multiple hosts and to continue to search
for workloads that test the boundaries of what is possible in a virtualized
environment.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>