VMmark Performance Virtualization

Performance Scaling of an Entry-Level Cluster

Performance benchmarking is often conducted on top-of-the-line hardware, including hosts that typically have a large number of cores, maximum memory, and the fastest disks available. Hardware of this caliber is not always accessible to small or medium-sized businesses with modest IT budgets. As part of our ongoing investigation of different ways to benchmark the cloud using the newly released VMmark 2.0, we set out to determine whether a cluster of less powerful hosts could be a viable alternative for these businesses. We used VMmark 2.0 to see how a four-host cluster with a modest hardware configuration would scale under increasing load.

Workload throughput is often limited by disk performance, so the tests were repeated with two different storage arrays to show the effect that upgrading the storage would offer in terms of performance improvement. We tested two disk arrays that varied in both speed and number of disks, an EMC CX500 and an EMC CX3-20, while holding all other characteristics of the testbed constant.

To review, VMmark 2.0 is a next-generation, multi-host virtualization benchmark that models application performance and the effects of common infrastructure operations such as vMotion, Storage vMotion, and a virtual machine deployment. Each tile contains Microsoft Exchange 2007, DVD Store 2.1, and Olio application workloads which run in a throttled fashion. The Storage vMotion and VM deployment infrastructure operations require the user to specify a LUN as the storage destination. The VMmark 2.0 score is computed as a weighted average of application workload throughput and infrastructure operation throughput. For more details about VMmark 2.0, see the VMmark 2.0 website or Joshua Schnee’s description of the benchmark.

Configuration
All tests were conducted on a cluster of four Dell PowerEdge R310 hosts running VMware ESX 4.1 and managed by VMware vCenter Server 4.1.  These are typical of today’s entry-level servers; each server contained a single quad-core Intel Xeon 2.80 GHz X3460 processor (with hyperthreading enabled) and 32 GB of RAM.  The servers also used two 1Gbit NICs for VM traffic and a third 1Gbit NIC for vMotion activity.

To determine the relative impact of different storage solutions on benchmark performance, runs were conducted on two existing storage arrays, an EMC CX500 and an EMC CX3-20. For details on the array configurations, refer to Table 1 below. VMs were stored on identically configured ‘application’ LUNs, while a designated ‘maintenance’ LUN was used for the Storage vMotion and VM deployment operations.

Table 1. Disk Array Configuration   Table1-3

Results
To measure the cluster’s performance scaling under increasing load, we started by running one tile, then increased the number of tiles until the run failed to meet Quality of Service (QoS) requirements. As load is increased on the cluster, it is expected that the application throughput, CPU utilization, and VMmark 2.0 scores will increase; the VMmark score increases as a function of throughput. By scaling out the number of tiles, we hoped to determine the maximum load our four-host cluster of entry-level servers could support.  VMmark 2.0 scores will not scale linearly from one to three tiles because, in this configuration, the infrastructure operations load remained constant. Infrastructure load increases primarily as a function of cluster size. Although showing only a two host cluster, the relationship between application throughput, infrastructure operations throughput and number of tiles is demonstrated more clearly by this figure from Joshua Schnee’s recent blog article. Secondly, we expected to see improved performance when running on the CX3-20 versus the CX500 because the CX3-20 has a larger number of disks per LUN as well as faster individual drives. Figure 1 below details the scale out performance on the CX500 and the CX3-20 disk arrays using VMmark 2.0.

Figure 1. VMmark 2.0 Scale Out On a Four-Host Cluster

Figure1-2

Both configurations saw improved throughput from one to three tiles but at four tiles they failed to meet at least one QoS requirement. These results show that a user wanting to maintain an average cluster CPU utilization of 50% on their four-host cluster could count on the cluster to support a two-tile load. Note that in this experiment, increased scores across tiles are largely due to increased workload throughput rather than an increased number of infrastructure operations.

As expected, runs using the CX3-20 showed consistently higher normalized scores than those on the CX500. Runs on the CX3-20 outperformed the CX500 by 15%, 14%, and 12% on the one, two, and three-tile runs, respectively. The increased performance of the CX3-20 over the CX500 was accompanied by approximately 10% higher CPU utilization, which indicated that that the faster CX3-20 disks allowed the CPU to stay busier, increasing total throughput.

The results show that our cluster of entry-level servers with a modest disk array supported approximately 220 DVD Store 2.1 operations per second, 16 send-mail actions, and 235 Olio updates per second. A more robust disk array supported 270 DVD Store 2.1 operations per second, 16 send-mail actions, and 235 Olio updates per second with 20% lower latencies on average and a correspondingly slightly higher CPU utilization.

Note that this type of experiment is possible for the first time with VMmark 2.0; VMmark 1.x was limited to benchmarking a single host but the entry-level servers under test in this study would not have been able to support even a single VMmark 2.0 tile on an individual server. By spreading the load of one tile across a cluster of servers, however, it becomes possible to quantify the load that the cluster as a whole is capable of supporting.  Benchmarking our cluster with VMmark 2.0 has shown that even modest clusters running vSphere can deliver an enormous amount of computing power to run complex multi-tier workloads.

Future Directions
In this study, we scaled out VMmark 2.0 on a four-host entry-level cluster to measure performance scaling and the maximum supported number of tiles. This put a much higher load onto the cluster than might be typical for a small or medium business so that businesses can confidently deploy their application workloads.  An alternate experiment would be to run fewer tiles while measuring the performance of other enterprise-level features, such as VMware High Availability. This ability to benchmark the cloud in many different ways is one benefit of having a well-designed multi-host benchmark. Keep watching this blog for more interesting studies in benchmarking the cloud with VMmark 2.0.