Home > Blogs > VMware VROOM! Blog > Monthly Archives: September 2010

Monthly Archives: September 2010

Performance of Enterprise Java Applications on VMware vSphere 4.1 and SpringSource tc Server

VMware recently released a whitepaper presenting the results of a performance investigation using a representative enterprise-level Java application on VMware vSphere 4.1. The results of the tests discussed in that paper show that enterprise-level Java applications can provide excellent performance when deployed on VMware vSphere 4.1.  The main topics covered by the paper are a comparison of virtualized and native performance, and an examination of scale-up versus scale-out tradeoffs.

The paper first covers a set of tests that were performed to determine whether an enterprise-level Java application virtualized on VMware vSphere 4.1 can provide equivalent performance to a native deployment configured with the same memory and compute resources.  The tests used response-time as the primary metric for comparing the performance of native and virtualized deployments. The results show that at CPU utilization levels commonly found in real deployments, the native and virtual response-times are close enough to provide an essentially identical user-experience.  Even at peak load, with CPU utilization near the saturation point, the peak throughput of the virtualized application was within 90% of the native deployment.

The paper then discusses the results of an investigation of the performance impact of scaling-up the configuration of a single VM (adding more vCPUs) versus scaling-out to deploy the application on multiple smaller VMs. At loads below 80% CPU utilization, the response-times of scale-up and scale-out configurations using the same number of total vCPUs were effectively equivalent.  At higher loads, the peak-throughput results for the different configurations were also similar, with a slight advantage to scale-out configurations. 

The application used in these tests was Olio, a multi-tier enterprise application which implements a complete social-networking website.  Olio was deployed on SpringSource tc Server, running both natively and virtualized on vSphere 4.1.  

For more information, please read the full paper at http://www.vmware.com/resources/techresources/10158.  In addition, the author will be publishing additional results on his blog at http://communities.vmware.com/blogs/haroldr.

HPC Application Performance on ESX 4.1: NAMD

This is the second part in an on-going series on exploring performance issues of virtualizing HPC applications. In the first part we described the setup and considered memory bandwidth. Here we look at network latency in the context of a single application. Evaluating the effect of network latency in general is far more difficult since HPC apps range from ones needing micro-second latency to embarrassingly-parallel apps that work well on slow networks. NAMD is a molecular dynamics code that is definitely not embarrassingly parallel but is known to run fine over 1 GbE TCP/IP, at least for small clusters. As such it represents the network requirements of an important class of HPC apps.

NAMD is a molecular dynamics application used to investigate the properties of large molecules. NAMD supports both shared memory parallelism and multiple-machine parallelism using TCP/IP. The native results use up to 16 processes on a single machine (“local” mode). Future work will use multiple machines, but some idea of the performance issues involved can be obtained by running multiple VMs in various configurations on the same physical host. The benchmark consists of running 500 steps of the Satellite Tobacco Mosaic Virus. “STMV” consists of slightly over 1 million atoms, which is large enough to enable good scaling on fairly large clusters. Shown below are elapsed time measurements for various configurations. Each is an average of 3 runs and the repeatability is good. The virtual NIC is e1000 for all the virtualized cases.

An apples-to-apples comparison between native and virtual is obtained by disabling HT and using a single 8-vCPU VM. The VM is configured with 12GB and default ESX parameters are used. With no networking, the virtual overhead is just 1% as shown in Table 1.

Table 1. NAMD elapsed time, STMV molecule, HT disabled

Total Processes Native Virtual
4 1748 1768
8 915 926

The effect of splitting the application across multiple machines and using different network configurations can be tested in a virtual environment. For these tests HT is enabled to get the full performance of the machine. The single VM case is configured as above. The 2-VM cases are configured with 12GB, 8 vCPUs, and preferHT=1 (so each VM can be scheduled on a NUMA node). The 4-VM cases have 6GB, 4 vCPUs, and preferHT=0. When multiple VMs communicate using the same vSwitch, ESX handles all the traffic in memory. For the multiple vSwitch cases, each vSwitch is associated with a physical NIC which is connected to a physical switch. Since all networking traffic must go through this switch, this configuration will be the same as using multiple hosts in terms of inter-VM communication latencies. An overview of vSwitches and networking in ESX is available here.

Table 2. NAMD elapsed time, STMV molecule, HT enabled

Total Processes 4 8 16
Native 1761 1020 796
1 VM 1766 923 N/A
2 VMs, 1 vSwitch 1779 928 787
2 VMs, 2 vSwitches 1800 965 806
4 VMs, 1 vSwitch 1774 940 810
4 VMs, 4 vSwitches 1885 1113 903

The single VM case shows that HT has little effect on ESX performance when the extra logical processors are not used. However, HT does slow down the native 8 process case significantly. This appears to be due to Linux not scheduling one process per core when it has the opportunity, which the ESX scheduler does by default. Scalability from 4 to 8 processes for the single vSwitch cases is close to 1.9X, and from 8 to 16 processes (using the same number of cores, but taking advantage of HT) it is 1.17X. This is excellent scaling. Networking over the switch reduces the performance somewhat, especially for four vSwitches. Scaling for native is hurt because the application does not manage NUMA resources itself, and Linux is limited by how well it can do this. This allows one of the 16-process virtualized cases to be slightly faster than native, despite the virtualization and multiple-machine overheads. The 16-process cases have the best absolute performance, and therefore correspond to how NAMD would actually be configured in practice. Here, the performance of all the virtualized cases is very close to native, except for the 4-vSwitch case where the extra overhead of networking has a significant effect. This is expected and should not be compared to the native case since the virtual case models four hosts. We plan to investigate multiple-host scaling soon to enable a direct comparison. A useful simulation needs up to 10 million steps, which would only be practical on a large cluster and only if all the software components scale very well.

These tests show that a commonly-used molecular dynamics application can be virtualized on a single host with little or no overhead. This particular app is representative of HPC workloads with moderate networking requirements. Simulating four separate hosts by forcing networking to go outside the box causes a slowdown of about 12%, but it is likely the corresponding native test will see some slowdown as well. We plan to expand the testing to multiple hosts and to continue to search for workloads that test the boundaries of what is possible in a virtualized environment.

Next: Memory

HPC Application Performance on ESX 4.1: Stream

Recently VMware has seen increased interest in migrating High Performance Computing (HPC) applications to virtualized environments. This is due to the many advantages virtualization brings to HPC, including consolidation, support for heterogeneous OSes, ease of application development, security, job migration, and cloud computing (all described here). Currently some subset of HPC applications virtualize well from a performance perspective. Our long-term goal is to extend this to all HPC apps, realizing that large-scale apps with the lowest latency and highest bandwidth requirements will be the most challenging. Users who run HPC apps are traditionally very sensitive to performance overhead, so it is important to quantify the performance cost of virtualization and properly weigh it against the advantages. Compared to commercial apps (databases, web servers, and so on), which are VMware’s bread-and-butter, HPC apps place their own set of requirements on the platform (OS/hypervisor/hardware) in order to execute well. Two common ones are low-latency networking (since a single app is often spread across a cluster of machines) and high memory bandwidth. This article is the first in a series that will explore these and other aspects of HPC performance. Our goal will always be to determine what works, what doesn’t, and how to get more of the former. The benchmark reported on here is Stream, which is a standard tool designed to measure memory bandwidth. It is a “worst case” micro-benchmark; real applications will not achieve higher memory bandwidth.


All tests were performed on an HP DL380 with two Intel X5570 processors, 48 GB memory (12 × 4 GB DIMMs), and four 1-GbE NICs (Intel Pro/1000 PT Quad Port Server Adapter) connected to a switch. Guest and native OS is RHEL 5.5 x86_64. Hyper-threading is enabled in the BIOS, so 16 logical processors are available. Processors and memory are split between two NUMA nodes. A pre-GA lab version of ESX 4.1 was used, build 254859.

Test Results

The OpenMP version of Stream is used. It is built using a compiler switch as follows:

gcc -O2 -fopenmp stream.c -o stream

The number of simultaneous threads is controlled by an environment variable:


The array size (N) and number of iterations (NTIMES) are hard-wired in the code as N=108 (for a single machine) and NTIMES=40. The large array size ensures that the processor cache provides little or no benefit. Stream reports maximum memory bandwidth performance in MB/sec for four tests: copy, scale, add, and triad (see the above link for descriptions of these). M stands for 1 million, not 220. Here are the native results, as a function of the number of threads:

Table 1. Native memory bandwidth, MB/s

Threads 1 2 4 8 16
Copy 6388 12163 20473 26957 26312
Scalar 5231 10068 17208 25932 26530
Add 7070 13274 21481 29081 29622
Triad 6617 12505 21058 29328 29889

Note that the scaling starts to fall off after two threads and the memory links are essentially saturated at 8 threads. This is one reason why HPC apps often do not see much benefit from enabling Hyper-Threading. To achieve the maximum aggregate memory bandwidth in a virtualized environment, two virtual machines (VMs) with 8 vCPUs each were used. This is appropriate only for modeling apps that can be split across multiple machines. One instance of stream with N=5×107 was run in each VM simultaneously so the total amount of memory accessed was the same as in the native test. The advanced configuration option preferHT=1 is used (see below). Bandwidths reported by the VMs are summed to get the total. The results are shown in Table 2: just slightly greater bandwidth than for the corresponding native case.

Table 2. Virtualized total memory bandwidth, MB/s, 2 VMs, preferHT=1

Total Threads 2 4 8 16
Copy 12535 22536 27606 27104
Scalar 10294 18824 26781 26537
Add 13578 24182 30676 30537
Triad 13070 23476 30449 30010

It is apparent that the Linux “first-touch” scheduling algorithm together with the simplicity of the Stream algorithm are enough to ensure that nearly all memory accesses in the native tests are “local” (that is, the processor each thread runs on and the memory it accesses both belong to the same NUMA node). In ESX 4.1 NUMA information is not passed to the guest OS and (by default) 8-vCPU VMs are scheduled across NUMA nodes in order to take advantage of more physical cores. This means that about half of memory accesses will be “remote” and that in the default configuration one or two VMs must produce significantly less bandwidth than the native tests. Setting preferHT=1 tells the ESX scheduler to count logical processors (hardware threads) instead of cores when determining if a given VM can fit on a NUMA node. In this case that forces both memory and CPU of an 8-vCPU VM to be scheduled on a single NUMA node. This guarantees all memory accesses are local and the aggregate bandwidth of two VMs can equal or exceed native bandwidth. Note that a single VM cannot match this bandwidth. It will get either half of it (because it’s using the resources of only one NUMA node), or about 70% (because half the memory accesses are remote). In both native and virtual environments, the maximum bandwidth of purely remote memory accesses is about half that of purely local. On machines with more NUMA nodes, remote memory bandwidth may be less and the importance of memory locality even greater.


In both native and virtualized environments, equivalent maximum memory bandwidth can be achieved as long as the application is written or configured to use only local memory. For native this means relying on the Linux “first-touch” scheduling algorithm (for simple apps) or implementing explicit mechanisms in the code (usually difficult if the code wasn’t designed for NUMA). For virtual a different mindset is needed: the application needs to be able to run across multiple machines, with each VM sized to fit on a NUMA node. On machines with hyper-threading enabled, preferHT=1 needs to be set for the larger VMs. If these requirements can be met, then a valuable feature of virtualization is that the app needs to have no NUMA awareness at all; NUMA scheduling is taken care of by the hypervisor (for all apps, not just for those where Linux is able to align threads and memory on the same NUMA node). For those apps where these requirements can’t be met (ones that need a large single instance OS), current development focus is on relaxing these requirements so they are more like native, while retaining the above advantage for small VMs.

Next: NAMD