This is the second part in an on-going series on exploring performance issues of
virtualizing HPC applications. In the first
href="/performance/2010/09/hpc-application-performance-on-esx-41-stream.html">part
we described the setup and considered memory bandwidth. Here we look at network
latency in the context of a single application. Evaluating the effect of network
latency in general is far more difficult since HPC apps range from ones needing
micro-second latency to embarrassingly-parallel apps that work well on slow
networks. NAMD is a molecular dynamics code that is
definitely not embarrassingly parallel but is known to run fine over 1 GbE TCP/IP, at least for small clusters. As such it
represents the network requirements of an important class of HPC
apps.
NAMD is a molecular dynamics application used to investigate the properties of large
molecules. NAMD supports both shared-memory parallelism and multiple-machine
parallelism using TCP/IP. The native results use up to 16 processes on a single
machine (“local” mode). Future work will use multiple machines, but some idea of
the performance issues involved can be obtained by running multiple VMs in
various configurations on the same physical host. The benchmark consists of
running 500 steps of the Satellite Tobacco Mosaic
Virus. “STMV” consists of slightly over 1 million atoms, which is large enough to enable good
scaling on fairly large clusters. Shown below are elapsed time measurements for
various configurations. Each is an average of 3 runs and the repeatability is
good. The virtual NIC is e1000 for all the virtualized
cases.
An
apples-to-apples comparison between native and virtual is obtained by disabling
HT and using a single 8-vCPU VM. The VM is configured
with 12GB and default ESX parameters are used. With no networking, the virtual
overhead is just 1% as shown in Table 1.
Table 1. NAMD elapsed time, STMV molecule, HT disabled
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>Total
processes
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>4
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>8
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">Native
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>1748
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>915
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">1 VM
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>1768
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center> style="FONT-SIZE: 12pt">926
The effect of splitting the application across multiple machines and using different
network configurations can be tested in a virtual environment. For these tests
HT is enabled to get the full performance of the machine. The single VM case is
configured as above. The 2-VM cases are configured with 12GB, 8 vCPUs, and preferHT=1 (so each VM
can be scheduled on a NUMA node). The 4-VM cases have 6GB, 4 vCPUs, and preferHT=0. When
multiple VMs communicate using the same vSwitch, ESX
handles all the traffic in memory. For the multiple vSwitch cases, each vSwitch is
associated with a physical NIC which is connected to a physical switch. Since all networking traffic must go
through this switch, this configuration will be the same as using multiple hosts
in terms of inter-VM communication latencies. An overview of vSwitches and networking in ESX is available
href="http://www.vmware.com/pdf/vsphere4/r41/vsp_41_esx_server_config.pdf">here.
Table 2. NAMD elapsed time, STMV molecule, HT enabled
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>Total
processes
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>4
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>8
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>16
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">Native
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>1761
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>1020
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>796
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">1 VM
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>1766
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>923
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>-
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">2 VMs, 1 class=SpellE>vSwitch
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>1779
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>928
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>787
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">2 VMs, 2 class=SpellE>vSwitches
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>1800
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>965
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>806
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; "> style="FONT-SIZE: 12pt">4 VMs, 1 class=SpellE>vSwitch
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>1774
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>940
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center; "
align=center>810
style="FONT-SIZE: 12pt">4 VMs, 4 class=SpellE>vSwitches
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>1885
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center>1113
style="MARGIN-BOTTOM: 0pt; LINE-HEIGHT: normal; TEXT-ALIGN: center"
align=center> style="FONT-SIZE: 12pt">903
The single VM case shows that HT has little effect on ESX performance when the extra
logical processors are not used. However, HT does slow down the native 8 process
case significantly. This appears to be due to Linux not scheduling one process
per core when it has the opportunity, which the ESX scheduler does by default.
Scalability from 4 to 8 processes for the single
class=SpellE>vSwitch cases is close to 1.9X, and from 8 to 16 processes
(using the same number of cores, but taking advantage of HT) it is 1.17X. This
is excellent scaling. Networking over the switch reduces the performance
somewhat, especially for four vSwitches. Scaling for
native is hurt because the application does not manage NUMA resources itself,
and Linux is limited by how well it can do this. This allows one of the
16-process virtualized cases to be slightly faster than native, despite the
virtualization and multiple-machine overheads. The 16-process cases have the
best absolute performance, and therefore correspond to how NAMD would actually
be configured in practice. Here, the performance of all the virtualized cases is
very close to native, except for the 4-vSwitch case where the extra overhead of
networking has a significant effect. This is expected and should not be compared
to the native case since the virtual case models four hosts. We plan to
investigate multiple-host scaling soon to enable a direct comparison. A useful
simulation needs up to 10 million steps, which would only be practical on a
large cluster and only if all the software components scale very well.
These tests
show that a commonly-used molecular dynamics application can be virtualized on a
single host with little or no overhead. This particular app is representative of
HPC workloads with moderate networking requirements. Simulating four separate
hosts by forcing networking to go outside the box causes a slowdown of about
12%, but it is likely the corresponding native test will see some slowdown as
well. We plan to expand the testing to multiple hosts and to continue to search
for workloads that test the boundaries of what is possible in a virtualized
environment.
