The performance of I/O is critical to achieving good overall performance on
enterprise applications. Workloads like transaction processing systems, web
applications, and mail servers are sensitive to the throughput and latency of
the I/O subsystem. In order for VMware ESX
to run these applications well, it needs to push large amounts of I/O without
adding significant latencies.
To demonstrate the scalability of the ESX I/O stack, we decided to see if
ESX could sustain 100,000 IOPS. Many enterprise applications access their data
in relatively small I/O blocks placed throughout the dataset. So the metric we want to focus on is random
I/O throughput, measured in I/O operations per second (IOPS), rather than raw
bandwidth. We used a workload that was
100% random with a 50/50 read/write mix and an 8KB block size.
The next step was to get our hands on enough storage to run the experiments
on a large scale. We went to the Midrange Partner Solutions Engineering team at
lab. They loaned us three CLARiiON CX3-80 storage arrays, each with 165 15K RPM
disks, for a total of 495 disks and 77TB of storage. Our experiments used the
Iometer I/O stress tool running in virtual machines on a server equipped with
ESX 3.5 Update 1. The server was a quad-core, quad-socket (16 cores total)
system with 32GB of physical memory.
We ramped up the I/O rate on the system while keeping a close eye on the I/O
latencies. We managed to achieve over 100K IOPS before running out of disk
bandwidth on the storage arrays. And we still had plenty of headroom to spare
on the server running ESX. To put this
into perspective, the 77TB of raw storage used in these experiments is enough
to hold the entire printed Library of Congress. You’d need to run 200,000
Microsoft Exchange mailboxes (LoadGen heavy user profile) or 85 average 4-way
database servers to generate an I/O rate of 100K IOPS.
The sections below present a more detailed discussion of the set of
experiments we ran and provide additional information on the experimental
Details of I/O Workload
We chose a workload that was representative of most common transaction oriented applications. We defined an I/O pattern with 8KB block size, 50% read + 50% write, and 100% random access. Enterprise applications like Microsoft Exchange and transaction-oriented databases like Oracle and Microsoft SQL Server use similar I/O patterns. Figure 1 shows a screen shot of the Iometer access specification.
Figure 1. Iometer Access Specification
Experiments and Results
Our first set of experiments was to show the I/O scalability of ESX. We started with two virtual machines and doubled the number of virtual machines each time while keeping the outstanding I/Os constant at eight. Figure 2 shows the variation of I/O and latency as the number of virtual machines increases.
Figure 2. I/O Operations Per Second and Latency of I/O Operations vs. Number of Virtual Machines
As seen in Figure 2, IOPS scale well with the number of virtual machines, while the latency of each I/O access increases only marginally.
In another set of experiments, we wanted to demonstrate the capabilities of ESX to absorb the burstiness of I/O workloads running in multiple virtual machines. On our ESX host we powered on 16 virtual machines and ran a series of tests, gradually increasing the number of outstanding I/Os from 1 to 12.
Figure 3. IOPS and Latency of I/O Operations vs. Number of Outstanding I/Os per LUN
As seen in Figure 3, the number of IOPS increased in a nearly linear fashion with increase in number of outstanding I/Os, as did the latency of I/O access. However the rate of increase in number of IOPS was faster than the rate of increase of latency until six outstanding I/Os. Beyond six outstanding I/Os, the latency increased faster than the number of IOPS, probably due to queuing in the storage array as its components were operating close to saturation.
To confirm that the increase in latency was not due to ESX but was instead because of queuing in the storage array, we measured the response time of each LUN in the storage array used for test virtual disks using Navisphere Analyzer for different outstanding I/Os. Figure 4 shows a comparison of the corresponding latencies seen in the guest (measured using PerfMon) with those seen in storage.
Figure 4. Disk Response Time
As seen from the graph, the I/O latency measured in the guest is very close to the latency measured in storage. This indicates that there is no queuing at any layer (in the guest, ESX, or the HBAs) other than storage and the response time of an I/O access seen in the guest is mostly due to the response time of the storage.
Our experiments show that ESX can easily scale to above 100,000 IOPS. We could have gone well beyond 100,000, but that was as far as we could stretch it with the number of disks we were able to get at short notice. Increasing outstanding I/Os did not help further as it only increased the latency. 100,000 IOPS in itself is a very high I/O throughput being driven by just one ESX host with several virtual machines running on it. The I/O latencies were still within acceptable limits and were mainly due to storage response time.
This study would not have been possible without the help of Bob Ng and Kris Chau, who set up the storage infrastructure in a very short time. When we were in need of additional disks to drive more I/O, Kathleen Sharp quickly got us the third storage array which helped us to complete our experiments. I would like to thank all of them and acknowledge that without their support, this work wouldn’t have been possible.
As enterprises are moving towards a virtualized data center, more and more virtual servers are being deployed on fewer physical systems running ESX. In order to facilitate smooth migration towards virtualization, ESX has to be capable of meeting the I/O access demands of virtual servers running a wide variety of applications. In this study, we have shown that ESX can easily support 100,000 IOPS for random access patterns which have a mix of read and write and a block size of 8KB. These high watermark tests show that the I/O stack in VMware ESX is highly scalable and can be used with confidence in data centers running workloads with heavy I/O profiles in virtual machines.
- 4 Intel Tigerton processors (a total of 16 cores)
- 32GB of physical memory
- Two on-board gigabit Ethernet controllers, two Intel gigabit network adapters
- Two dual-port QLogic 2462 HBAs (4GBps) and two single port QLogic 2460 HBAs (4GBps)
- Three CX3-80 arrays, each with 165 15K RPM Fibre Channel disks
- Flare OS: 03.26.080.3.086
- Read cache: 1GB (per storage processor)
- Write cache: 3GB (per array)
- ESX 3.5 update 1
- 1 virtual processor
- 1GB virtual memory
- 1 virtual NIC with Intel e1000 driver
- Guest OS: Windows Server 2003 Enterprise Edition (64-bit) with Service Pack 2
I/O Stress Tool:
- Iometer version 2006.07.27
To drive over 100,000 IOPS all the available disks in the storage systems were used. A total of 100 virtual disks, each 40GB in size, were created and distributed among the virtual machines. These were on 100GB LUNs, 98 of which were created on five-disk RAID 0 groups while 2 of them were on LUNs hosted on separate single disks. All LUNs were formatted with VMFS3 filesystem. The 4TB of virtual disk space eliminated any read-caching effect from the storage array.
A three-disk RAID 0 group was created in one of the storage arrays and a 400GB LUN was created in this RAID group. The LUN was then formatted with the VMFS3 file system. A 6GB virtual disk was created for each virtual machine in this VMFS partition. These virtual disks were used to install the guest operating system for each virtual machine.
Both storage processors on each storage array were connected to one of the
six QLogic HBA ports on the server.
We ran Iometer console on a client machine and Dynamo in each of the
virtual machines. This enabled us to control the I/O workload in each
virtual machine through one console. The outstanding I/Os and I/O
access specifications were identical for each virtual machine for a
given test case.
Tuning for Performance
You might be wondering what parameters we tuned to obtain this kind of performance. The answer will surprise most: we tuned only three parameters to obtain 100,000+ IOPS.
- We increased the VMFS3 max heap size from 16MB to 64MB (KB article # 1004424).
- We changed the storage processor’s cache high/low watermark from 80/60 to 40/20. This was done to write the dirty pages in storage cache more often so that Iometer write operations do not wait for free memory buffers.
- We increased the guest queue length to 100 to make sure that the guest was capable of queuing all the I/O accesses generated by Iometer to the test disks.