HPC Application Performance on ESX 4.1: NAMD

This is the second part in an on-going series on exploring performance issues of virtualizing HPC applications. In the first part we described the setup and considered memory bandwidth. Here we look at network latency in the context of a single application. Evaluating the effect of network latency in general is far more difficult since HPC apps range from ones needing micro-second latency to embarrassingly-parallel apps that work well on slow networks. NAMD is a molecular dynamics code that is definitely not embarrassingly parallel but is known to run fine over 1 GbE TCP/IP, at least for small clusters. As such it represents the network requirements of an important class of HPC apps.

NAMD is a molecular dynamics application used to investigate the properties of large molecules. NAMD supports both shared memory parallelism and multiple-machine parallelism using TCP/IP. The native results use up to 16 processes on a single machine (“local” mode). Future work will use multiple machines, but some idea of the performance issues involved can be obtained by running multiple VMs in various configurations on the same physical host. The benchmark consists of running 500 steps of the Satellite Tobacco Mosaic Virus. “STMV” consists of slightly over 1 million atoms, which is large enough to enable good scaling on fairly large clusters. Shown below are elapsed time measurements for various configurations. Each is an average of 3 runs and the repeatability is good. The virtual NIC is e1000 for all the virtualized cases.

An apples-to-apples comparison between native and virtual is obtained by disabling HT and using a single 8-vCPU VM. The VM is configured with 12GB and default ESX parameters are used. With no networking, the virtual overhead is just 1% as shown in Table 1.

Table 1. NAMD elapsed time, STMV molecule, HT disabled

Total Processes	Native	Virtual
4	1748	1768
8	915	926

The effect of splitting the application across multiple machines and using different network configurations can be tested in a virtual environment. For these tests HT is enabled to get the full performance of the machine. The single VM case is configured as above. The 2-VM cases are configured with 12GB, 8 vCPUs, and preferHT=1 (so each VM can be scheduled on a NUMA node). The 4-VM cases have 6GB, 4 vCPUs, and preferHT=0. When multiple VMs communicate using the same vSwitch, ESX handles all the traffic in memory. For the multiple vSwitch cases, each vSwitch is associated with a physical NIC which is connected to a physical switch. Since all networking traffic must go through this switch, this configuration will be the same as using multiple hosts in terms of inter-VM communication latencies. An overview of vSwitches and networking in ESX is available here.

Table 2. NAMD elapsed time, STMV molecule, HT enabled

Total Processes	4	8	16
Native	1761	1020	796
1 VM	1766	923	N/A
2 VMs, 1 vSwitch	1779	928	787
2 VMs, 2 vSwitches	1800	965	806
4 VMs, 1 vSwitch	1774	940	810
4 VMs, 4 vSwitches	1885	1113	903

The single VM case shows that HT has little effect on ESX performance when the extra logical processors are not used. However, HT does slow down the native 8 process case significantly. This appears to be due to Linux not scheduling one process per core when it has the opportunity, which the ESX scheduler does by default. Scalability from 4 to 8 processes for the single vSwitch cases is close to 1.9X, and from 8 to 16 processes (using the same number of cores, but taking advantage of HT) it is 1.17X. This is excellent scaling. Networking over the switch reduces the performance somewhat, especially for four vSwitches. Scaling for native is hurt because the application does not manage NUMA resources itself, and Linux is limited by how well it can do this. This allows one of the 16-process virtualized cases to be slightly faster than native, despite the virtualization and multiple-machine overheads. The 16-process cases have the best absolute performance, and therefore correspond to how NAMD would actually be configured in practice. Here, the performance of all the virtualized cases is very close to native, except for the 4-vSwitch case where the extra overhead of networking has a significant effect. This is expected and should not be compared to the native case since the virtual case models four hosts. We plan to investigate multiple-host scaling soon to enable a direct comparison. A useful simulation needs up to 10 million steps, which would only be practical on a large cluster and only if all the software components scale very well.

These tests show that a commonly-used molecular dynamics application can be virtualized on a single host with little or no overhead. This particular app is representative of HPC workloads with moderate networking requirements. Simulating four separate hosts by forcing networking to go outside the box causes a slowdown of about 12%, but it is likely the corresponding native test will see some slowdown as well. We plan to expand the testing to multiple hosts and to continue to search for workloads that test the boundaries of what is possible in a virtualized environment.

Next: Memory