Benchmarks Performance Virtualization VMmark

Trying VMmark on Some New Hardware

I recently received a new HP DL380G5 server sporting the new Intel Woodcrest dual-core processors. I was finally ready to try it out on the Friday before Christmas and I didn’t want to be a Grinch by asking the lab support folks to connect a disk array before the long weekend, so I quickly set up a test using the built-in SAS controller, which supports up to 8 SAS drives. Two of the drives were configured in RAID 1 and held the ESX installation. I configured the six remaining drives into a single RAID 0 LUN and then created three equally sized partitions for three vmfs filesystems, one for each VMmark tile I expected to run. I then ran the benchmark with 1, 2, and 3 tiles. One, two and three tiles consumed roughly 40%, 80%, and 100% of the CPU resources, respectively.

VMmark was designed to mimic the uneven demands across workloads typical of multi-VM consolidation environments. In general, the way in which each workload performs and scales depends upon the capabilities of the subsystems on which it depends, e.g., disk and CPU for the database server. Analyzing the variations in performance as the underlying hardware components become saturated helps to validate the system configuration as well as the proper behavior of the virtualization layer. Although the scaling at 3 tiles looked decent, I wanted to see why it wasn’t better, so I examined the scores of the individual workloads.

The mail server scales perfectly from 1 to 2 tiles and then improves by another 25%, equal to the 25% increase in CPU utilization, when a third tile is added to fully saturate the CPU on the system. The java server scales perfectly up to 3 tiles. In this case, each java server VM is allocated sufficient CPU shares to achieve full performance and excellent scaling even with 3 tiles due to the think-time constraints built into VMmark. Both the database server and the web server exhibit roughly linear scaling from 1 to 2 tiles. In a less than fully utilized system, both workloads tend to consume more than their guaranteed share of the CPU resources. As the system becomes saturated, it is likely that they will not receive as many CPU resources as they could consume, leading to poorer individual scaling. In this test, the database server gets a reasonable but not perfect boost going to 3 tiles. The web server, which is typically the greedier of the two, actually sees a drop in its overall score with 3 tiles. So far, everything is behaving more or less as expected. The poor fileserver scaling is less intuitive. The aggregate throughput peaks at roughly 35MB/s running a single tile. With two and three VMs, the aggregate throughput drops to 29MB/s and 23MB/s, respectively. I have seen this behavior before on other systems and even discussed it during my VMworld session on VMmark. Examining the esxtop disk statistics shows that the SCSI queue for the LUN is a source of contention among multiple fileserver VMs. I know from experience that simply increasing the queue depth will improve the situation somewhat but a better solution is to put each tile on a separate LUN.

In this scheme, the file server VMs are separated and cannot negatively affect each other when they fill a SCSI queue. In this case, I didn’t have extra disks available, so I deleted the existing 6-disk LUN and replaced it with 3 2-disk LUNs. I then recreated the tiles, one on each LUN. Re-running the tests provided some interesting insight. Figure 3 shows the workload scores using the new disk layout. The comparison of the workload scores obtained using the two different disk layouts is then shown in Figure 4. In the 1-tile case, all of the workload scores except the file server were about equal. However, the fileserver in the 2-disk LUN case only managed 12.5 MB/s, roughly 1/3 of the throughput of the 6-disk LUN. This should be no surprise since we are using 1/3 as many disks.

For the two-tile case, using separate LUNs yields only a slightly lower overall score, again due largely to the slightly lower fileserver throughput of 26MB/s (vs. 29MB/s with a single LUN). The important thing to note is that the scaling is much better due to the separation of the fileservers. The payoff comes when 3 tiles are run and the file server VMs continue to achieve excellent scaling. With 3 separate LUNs, the aggregate fileserver throughput is 38MB/s (vs. 23MB/s using a single LUN). The improved fileserver throughput does come at the cost of slightly lower throughputs for the database and web server workloads, since the fileserver VMs are now able to fully utilize their allocated shares of the system. Figure 5 compares the overall VMmark scores of the single-LUN and multi-LUN configurations. We see that the multi-LUN layout is slower for one tile, roughly equal for two tiles and better for three tiles due to improved file server results.

These two different disk configurations highlight some interesting tradeoffs and tuning opportunities exposed by VMmark. The single LUN configuration utilizing six disks has the benefit of providing high disk throughput for one VM at the expense of scalability if multiple disk-intensive VMs are running. On the other hand, creating multiple LUNs provides both good predictability and excellent scaling but limits the total throughput of any single VM by providing only a subset of the hardware resources to each one. From a benchmarking perspective, the multi-LUN approach is clearly better since it results in a higher overall score. In practice, the proper approach depends upon the needs and goals of each user. I am excited by the ability VMmark gives us to study these types of performance tuning tradeoffs in a representative multi-VM environment. I feel that building performance tuning expertise in these complex situations and getting that information to our customers along with the ability to evaluate hardware and software platforms for virtualization should make VMmark an extremely valuable tool. Please stay tuned as we work to make that a reality.