Measuring Cluster Scaling with VMmark

As I mentioned on April 30th, we have released VMmark 1.1 to our partners and intend a general release in the near future. We have been extremely pleased with the virtualization community’s response to VMmark. It clearly addresses an important need: reliably measuring the performance of virtualization platforms in a representative and fair way. However, while we were planning VMmark 1.1, we were struck by the startlingly fast evolution of virtualization technologies. Single-system performance as measured by VMmark is quickly becoming only a portion of the performance equation in virtualized environments. Datacenters no longer contain a set of disconnected virtualized server silos but a fully dynamic set of cooperating systems enabled by the no-downtime movement of virtual servers among the underlying physical hosts using VMotion. Anything less fails to realize the full value of virtualization. With this new reality in mind, we have been experimenting with VMmark 1.0 in parallel with the development of VMmark 1.1 in order to understand the issues in creating a next-generation cluster-aware virtualization benchmark. The results have been very encouraging. Over the next few days, I will be sharing some of our early results with you.

When we set about designing a prototype of a cluster-based benchmark, we defined two requirements based on the way customers are using dynamic datacenters. The first requirement was that there be no downtime during server migrations, including no interruptions to service such as dropped connections. This is easy to ensure since the VMmark benchmark harness will detect these types of failures, resulting in a non-compliant benchmark run. Secondly, we disallow any type of initial distribution of the virtual servers across hosts in order to force the virtual infrastructure to rebalance the load. This ensures that the virtualization infrastructure is able actively manage the load across physical hosts. We satisfied the no-initial-placement requirement by requiring that any benchmark test begin with all VMs running on a single physical host. These goals were aided by the inherent flexibility of the VMmark harness, which communicates with the various server VMs without regard to the underlying physical resources. It was almost trivial to execute the benchmark while the server VMs were executing on multiple physical hosts and VMotioning between them, even under extremely heavy loads.

Experimental Setup

Next, we went into our lab and pulled together a small set of machines with which to test our ideas. We installed VMware Virtual Infrastructure 3 version 3.5 (VI3.5) and configured the hosts as a cluster. Our test equipment is listed below:

Servers

Dell 2950, 2 x Intel Xeon X5365 @ 3.0GHz, 32GB
HP DL380G5, 2 x Intel Xeon X5460 @ 3.16GHz, 32GB
IBM 3650, 2 x Intel Xeon X5365 @ 3.0GHz, 48GB
Sun x4150, 2 x Intel Xeon X5355 @ 2.66GHz, 64GB

All servers contained two Intel e1000 dual-port NICs, which were allocated to the virtual machines. Each server’s onboard NICs were allocated for VMotion and COS. All servers utilized one Qlogic 2462 dual-port FC HBA connected to the SAN.

SAN

3 x EMC CX3-20 disk arrays, each with 45 10k RPM 146GB disks

Each array had seven 4-disk RAID0 LUNs, each hosting a VMmark tile.

Clients

16 x HP DL360G5, 1 x Intel Xeon X5355 @ 2.66GHz, 4GB
4 x HP DL385G1, 2 x AMD 2218 @ 2.6 GHz, 4GB

Scoring Methodology

VMmark is a single-server virtualization benchmark. We cannot directly use the standard VMmark score as a metric since we are not strictly following the Run and Reporting Rules for the benchmark. However, the rules do allow for academic-style studies as long as the results are not reported as a VMmark score. (SPEC has similar rules.) Since we are primarily interested in performance scaling as we ramp up the number of tiles (workload VMs) on the cluster, we can simply normalize our throughput with respect to the throughput achieved by a single tile. Before we enabled the cluster, we ran a single tile on the HP DL380G5 (which happened to have the fastest CPUs) to generate a reference score. All cluster measurements are then divided by this reference score to obtain a scaling metric.

Results

We started off by running a single VMmark tile across the 4-node cluster using VMware’s Distributed Resource Scheduler (DRS) which automatically and dynamically balances the load across the cluster. We placed the six VMs that form a tile onto one of the servers and then let DRS balance the load automatically across all four hosts. The throughput metric exactly matched the single-tile throughput on the single server. In both cases, there was a large excess of resources and the workloads were able to achieve the same results.

We then ran similar experiments using 4, 10, 15, 16, 17, 18, 19, and 20 tiles (this means running 24, 60, 90, 96, 102, 108, 114, and 120 workload VMs, respectively, on the cluster). All four servers became CPU-saturated at 17 tiles and beyond. We varied the DRS aggressiveness and discovered that a setting of “2 Stars” which is slightly more aggressive than the default of “3 Stars” provided the best results. The results are shown in the graph below:

When running a single tile, the throughput was identical to the single-server reference score, which resulted in a scaling of 1.0. The 4-tile experiment presents an equivalent load of one tile per server to the cluster and results in a linear 4x scaling over a single tile. By 16 tiles, the cluster is nearing saturation of the physical CPUs, leading to scaling of 15.2x. Scaling is 15.9x once the CPU saturation point is reached at 17 tiles. We continued to add tiles until we exhausted our supply of client systems in order to assess the robustness of VI3.5 when running in an overcommitted situation. For tile counts of 18, 19, and 20 the performance held steady and achieved roughly the same score as at the saturation point of 17 tiles. As expected of a true enterprise-class solution, VI3.5 performs in a stable and predictable fashion in this highly stressful regime of heavy CPU utilization.

Our final experiment compares the throughput achieved by the fully automated DRS solution at the initial 17-tile saturation point with the throughput achieved running the benchmark in perfectly balanced fashion using hand placement of the workload VMs. In this case, hand placement achieves scaling of 16.5x versus 15.9x using DRS. Although VMware continues to work on improving DRS performance, I believe most users would agree that automatically delivering 96% of the best-case performance is an excellent result.

The Big Picture

Let’s take a step back and talk about what has been accomplished on this relatively modest cluster by running 17 VMmark tiles (102 server VMs). That translates into simultaneously:

Supporting 17,000 Exchange 2003 mail users.
Sustaining more that 35,000 database transactions per minute using MySQL/SysBench.
Driving more than 350 MB/s of disk IO.
Serving more than 30,000 web pages each minute.
Running 17 Java middle-tier servers.

We then increased that load by more than 17% without degrading that overall throughput of the cluster. I suspect that supporting such an extreme configuration using only four dual-socket, quad-core servers is more than most customers will attempt. But I am certain that they will find it reassuring to know that VI3.5 is up to the task and should have no trouble meeting the needs of a typical small or medium business with a few servers, not to mention large enterprises with much larger datacenters.

Future Work

In our next installment, I will demonstrate the ability of VMware’s Virtual Infrastructure to dynamically relieve resource bottlenecks like the cluster overcommitment scenario encountered above. Stay tuned.