Home > Blogs > VMware VROOM! Blog > Monthly Archives: May 2008

Monthly Archives: May 2008

Measuring Cluster Scaling with VMmark

As I mentioned on April 30th, we have released VMmark 1.1 to our partners and intend a general release in the near future. We have been extremely pleased with the virtualization community’s response to VMmark. It clearly addresses an important need: reliably measuring the performance of virtualization platforms in a representative and fair way. However, while we were planning VMmark 1.1, we were struck by the startlingly fast evolution of virtualization technologies. Single-system performance as measured by VMmark is quickly becoming only a portion of the performance equation in virtualized environments. Datacenters no longer contain a set of disconnected virtualized server silos but a fully dynamic set of cooperating systems enabled by the no-downtime movement of virtual servers among the underlying physical hosts using VMotion. Anything less fails to realize the full value of virtualization. With this new reality in mind, we have been experimenting with VMmark 1.0 in parallel with the development of VMmark 1.1 in order to understand the issues in creating a next-generation cluster-aware virtualization benchmark. The results have been very encouraging. Over the next few days, I will be sharing some of our early results with you.

When we set about designing a prototype of a cluster-based benchmark, we defined two requirements based on the way customers are using dynamic datacenters. The first requirement was that there be no downtime during server migrations, including no interruptions to service such as dropped connections. This is easy to ensure since the VMmark benchmark harness will detect these types of failures, resulting in a non-compliant benchmark run. Secondly, we disallow any type of initial distribution of the virtual servers across hosts in order to force the virtual infrastructure to rebalance the load. This ensures that the virtualization infrastructure is able actively manage the load across physical hosts. We satisfied the no-initial-placement requirement by requiring that any benchmark test begin with all VMs running on a single physical host. These goals were aided by the inherent flexibility of the VMmark harness, which communicates with the various server VMs without regard to the underlying physical resources. It was almost trivial to execute the benchmark while the server VMs were executing on multiple physical hosts and VMotioning between them, even under extremely heavy loads.

Experimental Setup

Next, we went into our lab and pulled together a small set of machines with which to test our ideas. We installed VMware Virtual Infrastructure 3 version 3.5 (VI3.5) and configured the hosts as a cluster. Our test equipment is listed below:

Servers

  • Dell 2950, 2 x Intel Xeon X5365 @ 3.0GHz, 32GB
  • HP DL380G5, 2 x Intel Xeon X5460 @ 3.16GHz, 32GB
  • IBM 3650, 2 x Intel Xeon X5365 @ 3.0GHz, 48GB
  • Sun x4150, 2 x Intel Xeon X5355 @ 2.66GHz, 64GB

All servers contained two Intel e1000 dual-port NICs, which were allocated to the virtual machines. Each server’s onboard NICs were allocated for VMotion and COS. All servers utilized one Qlogic 2462 dual-port FC HBA connected to the SAN.

SAN

  • 3 x EMC CX3-20 disk arrays, each with 45 10k RPM 146GB disks

Each array had seven 4-disk RAID0 LUNs, each hosting a VMmark tile.

Clients

  • 16 x HP DL360G5, 1 x Intel Xeon X5355 @ 2.66GHz, 4GB
  • 4 x HP DL385G1, 2 x AMD 2218 @ 2.6 GHz, 4GB

Scoring Methodology

VMmark is a single-server virtualization benchmark. We cannot directly use the standard VMmark score as a metric since we are not strictly following the Run and Reporting Rules for the benchmark. However, the rules do allow for academic-style studies as long as the results are not reported as a VMmark score. (SPEC has similar rules.) Since we are primarily interested in performance scaling as we ramp up the number of tiles (workload VMs) on the cluster, we can simply normalize our throughput with respect to the throughput achieved by a single tile. Before we enabled the cluster, we ran a single tile on the HP DL380G5 (which happened to have the fastest CPUs) to generate a reference score. All cluster measurements are then divided by this reference score to obtain a scaling metric.

Results

We started off by running a single VMmark tile across the 4-node cluster using VMware’s Distributed Resource Scheduler (DRS) which automatically and dynamically balances the load across the cluster. We placed the six VMs that form a tile onto one of the servers and then let DRS balance the load automatically across all four hosts. The throughput metric exactly matched the single-tile throughput on the single server. In both cases, there was a large excess of resources and the workloads were able to achieve the same results.

We then ran similar experiments using 4, 10, 15, 16, 17, 18, 19, and 20 tiles (this means running 24, 60, 90, 96, 102, 108, 114, and 120 workload VMs, respectively, on the cluster). All four servers became CPU-saturated at 17 tiles and beyond. We varied the DRS aggressiveness and discovered that a setting of “2 Stars” which is slightly more aggressive than the default of “3 Stars” provided the best results. The results are shown in the graph below:

Cluster_blog_scaling_1

When running a single tile, the throughput was identical to the single-server reference score, which resulted in a scaling of 1.0. The 4-tile experiment presents an equivalent load of one tile per server to the cluster and results in a linear 4x scaling over a single tile. By 16 tiles, the cluster is nearing saturation of the physical CPUs, leading to scaling of 15.2x. Scaling is 15.9x once the CPU saturation point is reached at 17 tiles. We continued to add tiles until we exhausted our supply of client systems in order to assess the robustness of VI3.5 when running in an overcommitted situation. For tile counts of 18, 19, and 20 the performance held steady and achieved roughly the same score as at the saturation point of 17 tiles. As expected of a true enterprise-class solution, VI3.5 performs in a stable and predictable fashion in this highly stressful regime of heavy CPU utilization.

Our final experiment compares the throughput achieved by the fully automated DRS solution at the initial 17-tile saturation point with the throughput achieved running the benchmark in perfectly balanced fashion using hand placement of the workload VMs. In this case, hand placement achieves scaling of 16.5x versus 15.9x using DRS. Although VMware continues to work on improving DRS performance, I believe most users would agree that automatically delivering 96% of the best-case performance is an excellent result.

The Big Picture

Let’s take a step back and talk about what has been accomplished on this relatively modest cluster by running 17 VMmark tiles (102 server VMs). That translates into simultaneously:

  • Supporting 17,000 Exchange 2003 mail users.
  • Sustaining more that 35,000 database transactions per minute using MySQL/SysBench.
  • Driving more than 350 MB/s of disk IO.
  • Serving more than 30,000 web pages each minute.
  • Running 17 Java middle-tier servers.

We then increased that load by more than 17% without degrading that overall throughput of the cluster. I suspect that supporting such an extreme configuration using only four dual-socket, quad-core servers is more than most customers will attempt. But I am certain that they will find it reassuring to know that VI3.5 is up to the task and should have no trouble meeting the needs of a typical small or medium business with a few servers, not to mention large enterprises with much larger datacenters.

Future Work

In our next installment, I will demonstrate the ability of VMware’s Virtual Infrastructure to dynamically relieve resource bottlenecks like the cluster overcommitment scenario encountered above. Stay tuned.

100,000 I/O Operations Per Second, One ESX Host

The performance of I/O is critical to achieving good overall performance on
enterprise applications. Workloads like transaction processing systems, web
applications, and mail servers are sensitive to the throughput and latency of
the I/O subsystem. In order for VMware ESX
to run these applications well, it needs to push large amounts of I/O without
adding significant latencies.

To demonstrate the scalability of the ESX I/O stack, we decided to see if
ESX could sustain 100,000 IOPS. Many enterprise applications access their data
in relatively small I/O blocks placed throughout the dataset. So the metric we want to focus on is random
I/O throughput, measured in I/O operations per second (IOPS), rather than raw
bandwidth.  We used a workload that was
100% random with a 50/50 read/write mix and an 8KB block size.

The next step was to get our hands on enough storage to run the experiments
on a large scale. We went to the Midrange Partner Solutions Engineering team at
EMC, Santa Clara and they were kind enough to let us use the storage infrastructure in their
lab. They loaned us three CLARiiON CX3-80 storage arrays, each with 165 15K RPM
disks, for a total of 495 disks and 77TB of storage. Our experiments used the
Iometer I/O stress tool running in virtual machines on a server equipped with
ESX 3.5 Update 1. The server was a quad-core, quad-socket (16 cores total)
system with 32GB of physical memory.

We ramped up the I/O rate on the system while keeping a close eye on the I/O
latencies. We managed to achieve over 100K IOPS before running out of disk
bandwidth on the storage arrays. And we still had plenty of headroom to spare
on the server running ESX. To put this
into perspective, the 77TB of raw storage used in these experiments is enough
to hold the entire printed Library of Congress. You’d need to run 200,000
Microsoft Exchange mailboxes (LoadGen heavy user profile) or 85 average 4-way
database servers to generate an I/O rate of 100K IOPS.

The sections below present a more detailed discussion of the set of
experiments we ran and provide additional information on the experimental
configuration.

Details of I/O Workload

We chose a workload that was representative of most common transaction oriented applications. We defined an I/O pattern with 8KB block size, 50% read + 50% write, and 100% random access. Enterprise applications like Microsoft Exchange and transaction-oriented databases like Oracle and Microsoft SQL Server use similar I/O patterns. Figure 1 shows a screen shot of the Iometer access specification.

Iometerworkloadspec

Figure 1. Iometer Access Specification

Experiments and Results

Our first set of experiments was to show the I/O scalability of ESX. We started with two virtual machines and doubled the number of virtual machines each time while keeping the outstanding I/Os constant at eight. Figure 2 shows the variation of I/O and latency as the number of virtual machines increases.

Figure2_3

Figure 2. I/O Operations Per Second and Latency of I/O Operations vs. Number of Virtual Machines

As seen in Figure 2, IOPS scale well with the number of virtual machines, while the latency of each I/O access increases only marginally.

In another set of experiments, we wanted to demonstrate the capabilities of ESX to absorb the burstiness of I/O workloads running in multiple virtual machines. On our ESX host we powered on 16 virtual machines and ran a series of tests, gradually increasing the number of outstanding I/Os from 1 to 12.

Figure3

Figure 3. IOPS and Latency of I/O Operations vs. Number of Outstanding I/Os per LUN

As seen in Figure 3, the number of IOPS increased in a nearly linear fashion with increase in number of outstanding I/Os, as did the latency of I/O access. However the rate of increase in number of IOPS was faster than the rate of increase of latency until six outstanding I/Os. Beyond six outstanding I/Os, the latency increased faster than the number of IOPS, probably due to queuing in the storage array as its components were operating close to saturation.

To confirm that the increase in latency was not due to ESX but was instead because of queuing in the storage array, we measured the response time of each LUN in the storage array used for test virtual disks using Navisphere Analyzer for different outstanding I/Os. Figure 4 shows a comparison of the corresponding latencies seen in the guest (measured using PerfMon) with those seen in storage.

Figure4

Figure 4. Disk Response Time

As seen from the graph, the I/O latency measured in the guest is very close to the latency measured in storage. This indicates that there is no queuing at any layer (in the guest, ESX, or the HBAs) other than storage and the response time of an I/O access seen in the guest is mostly due to the response time of the storage.

Our experiments show that ESX can easily scale to above 100,000 IOPS. We could have gone well beyond 100,000, but that was as far as we could stretch it with the number of disks we were able to get at short notice. Increasing outstanding I/Os did not help further as it only increased the latency. 100,000 IOPS in itself is a very high I/O throughput being driven by just one ESX host with several virtual machines running on it. The I/O latencies were still within acceptable limits and were mainly due to storage response time.

This study would not have been possible without the help of Bob Ng and Kris Chau, who set up the storage infrastructure in a very short time. When we were in need of additional disks to drive more I/O, Kathleen Sharp quickly got us the third storage array which helped us to complete our experiments. I would like to thank all of them and acknowledge that without their support, this work wouldn’t have been possible.

Summary

As enterprises are moving towards a virtualized data center, more and more virtual servers are being deployed on fewer physical systems running ESX. In order to facilitate smooth migration towards virtualization, ESX has to be capable of meeting the I/O access demands of virtual servers running a wide variety of applications. In this study, we have shown that ESX can easily support 100,000 IOPS for random access patterns which have a mix of read and write and a block size of 8KB. These high watermark tests show that the I/O stack in VMware ESX is highly scalable and can be used with confidence in data centers running workloads with heavy I/O profiles in virtual machines.

Configuration Details

Hardware:

Server:

  • 4 Intel Tigerton processors (a total of 16 cores)
  • 32GB of physical memory
  • Two on-board gigabit Ethernet controllers, two Intel gigabit network adapters
  • Two dual-port QLogic 2462 HBAs (4GBps) and two single port QLogic 2460 HBAs (4GBps)

Storage:

  • Three CX3-80 arrays, each with 165 15K RPM Fibre Channel disks
  • Flare OS: 03.26.080.3.086
  • Read cache: 1GB (per storage processor)
  • Write cache: 3GB (per array)

ESX:

  • ESX 3.5 update 1

Virtual Machines:

  • 1 virtual processor
  • 1GB virtual memory
  • 1 virtual NIC with Intel e1000 driver
  • Guest OS: Windows Server 2003 Enterprise Edition (64-bit) with Service Pack 2

I/O Stress Tool:

  • Iometer version 2006.07.27

Storage Layout

To drive over 100,000 IOPS all the available disks in the storage systems were used. A total of 100 virtual disks, each 40GB in size, were created and distributed among the virtual machines. These were on 100GB LUNs, 98 of which were created on five-disk RAID 0 groups while 2 of them were on LUNs hosted on separate single disks. All LUNs were formatted with VMFS3 filesystem. The 4TB of virtual disk space eliminated any read-caching effect from the storage array.

A three-disk RAID 0 group was created in one of the storage arrays and a 400GB LUN was created in this RAID group. The LUN was then formatted with the VMFS3 file system. A 6GB virtual disk was created for each virtual machine in this VMFS partition. These virtual disks were used to install the guest operating system for each virtual machine.

Both storage processors on each storage array were connected to one of the
six QLogic HBA ports on the server.

Iometer Setup

We ran Iometer console on a client machine and Dynamo in each of the
virtual machines. This enabled us to control the I/O workload in each
virtual machine through one console. The outstanding I/Os and I/O
access specifications were identical for each virtual machine for a
given test case.

Tuning for Performance

You might be wondering what parameters we tuned to obtain this kind of performance. The answer will surprise most: we tuned only three parameters to obtain 100,000+ IOPS.

  • We increased the VMFS3 max heap size from 16MB to 64MB (KB article # 1004424).
  • We changed the storage processor’s cache high/low watermark from 80/60 to 40/20. This was done to write the dirty pages in storage cache more often so that Iometer write operations do not wait for free memory buffers.
  • We increased the guest queue length to 100 to make sure that the guest was capable of queuing all the I/O accesses generated by Iometer to the test disks.

VMware Performance Tutorial at Usenix 2008

I’m starting a new tutorial this year at Usenix — all about performance and tuning of VMware ESX server. The session is on Tuesday, June 24, 2008 in Boston, and is an all-day class.

Who should attend:
Anyone who is involved in planning or deploying virtualization on
VMware ESX and wants to understand the performance characteristics of
applications in a virtualized environment.

We will walk through the implications to performance and capacity
planning in a virtualized world to learn about how to achieve best
performance in a VMware ESX enviroment.

Sysadmin

Take back to work: How to plan, understand, characterize, diagnose, and tune for best application performance on  VMware ESX.

Topics include:

    • Introduction to virtualization
    • Understanding different hardware acceleration techniques for virtualization
    • Diagnosing performance using VMware performance tools, including esxtop
    • Diagnosing performance using guest OS tools in a virtual environment
    • Practical limits and overheads for virtualization
    • Storage performance
    • Network throughput and options
    • Using Virtual-SMP
    • Guest Operating System Types
    • Understanding the characteristics of key applications, including Oracle, MS SQLserver, and MS Exchange
    • Capacity planning techniques

The cost for the class is $695, and there is an early bird registration discount price of $645. Sign-up is via the Usenix registration site.

Please comment if you have additional ideas and topics that you want to have covered, and I’ll do my best to incorporate them into the content.

I’ll post further updates in my blog

Richard

Sun Uses VMmark to Measure Power Consumption

A representative and well-understood benchmark like VMmark can be used as the basis for more elaborate experiments. As a case in point, our partners at Sun have been measuring the power consumption of their Sun Fire X4450 server while running VMmark with 8 tiles (48 total VMs). They have been kind enough to share their data with me and I have graphed it in the figure below:

Power_blog_sun_3

The first thing one notices is the roughly 1-hour ramp-up of power consumption from about 600 Watts to an average of about 830 Watts. (The max was roughly 850 Watts.) This is followed by 3 hours of steady-state usage and a sharp decline in consumption at the end. This mirrors a typical VMmark run where the workload VMs are ramped up in a staggered fashion followed by a 3-hour measurement interval before the benchmark ends. Intuitively, the power consumption of the server should rise with the increasing work being done, and it does. I find that confirmation itself quite valuable.

Much is made of the power savings potential of server consolidation using virtualization. I daresay that one would have a difficult time running 48 physical servers, no matter how efficient, on an average of 830 Watts. A simplistic analysis would show that, on average, each server could consume only 17.5 Watts. I’d like to see that server. Kudos to the folks at Sun for their excellent work in demonstrating this potential. If you want more information on the Sun Fire X4450, please check out the VMmark results page.

VirtualCenter 2.5 Database White Paper Posted

VMware® VirtualCenter database stores metadata on the state of a VMware Infrastructure environment and is a key component of VirtualCenter performance.   VirtualCenter 2.5 features a number of enhancements that are aimed at greatly improving the performance and scalability of the VirtualCenter database. This paper presents the performance results of benchmarks we conducted to validate these performance enhancements and to provide best practices information for configuring a  VirtualCenter database. The paper also provides information for sizing the server you use to host the VirtualCenter database based on these performance results. Although the new features in VirtualCenter 2.5 benefit users with any of the supported databases, the examples and performance data presented in this study are specific to Microsoft SQL Server and the paper assumes that you have a working knowledge of SQL Server.

http://www.vmware.com/files/pdf/vc_database_performance.pdf

Dell Publishes First AMD Quad-Core VMmark Results

Our partners at Dell have published the first VMmark results using the new AMD quad-core Barcelona processors. Both the 2-socket (8-core) R805 platform and the 4-socket (16-core) R905 platform have been tested. You can find all of the details on the VMmark results page. If you do the math, you will see that Dell achieved an excellent 1.8x throughput scaling from the 2-socket system to the 4-socket system. Another thing I’d like to point out is that some of the VMmark workloads utilized AMD’s Rapid Virtualization Indexing (RVI) technology to improve performance. VMware supports a wide range of virtualization techniques and is able to uniquely leverage both hardware and sofware virtualization technologies in order to provide optimal performance.

VMmark 1.1 Released to Partners

Just a quick note to announce that we have released version 1.1 of the VMmark benchmark to our hardware partners. As many of you know, VMmark 1.0 utilized only 32-bit workloads, which was a reasonable mix when the benchmark was first defined roughly three years ago. However, 64-bit applications and OSes are becoming much more prevalent and we need the ability to characterize this more complex reality. To address this, we have transitioned three of the VMmark workloads – Java server, database server, and web server – to 64-bit. In order to maintain comparability with the existing version 1.0 results, we have retained the underlying virtual hardware definitions and load levels for each workload. We need to tie up a few remaining loose ends, but we intend to make VMmark 1.1 generally available very soon. Please stay tuned.