VMware

VMware Performance Tutorial at Usenix 2008 | Main | Measuring Cluster Scaling with VMmark

May 22, 2008

100,000 I/O Operations Per Second, One ESX Host

The performance of I/O is critical to achieving good overall performance on enterprise applications. Workloads like transaction processing systems, web applications, and mail servers are sensitive to the throughput and latency of the I/O subsystem. In order for VMware ESX to run these applications well, it needs to push large amounts of I/O without adding significant latencies.

To demonstrate the scalability of the ESX I/O stack, we decided to see if ESX could sustain 100,000 IOPS. Many enterprise applications access their data in relatively small I/O blocks placed throughout the dataset. So the metric we want to focus on is random I/O throughput, measured in I/O operations per second (IOPS), rather than raw bandwidth.  We used a workload that was 100% random with a 50/50 read/write mix and an 8KB block size.

The next step was to get our hands on enough storage to run the experiments on a large scale. We went to the Midrange Partner Solutions Engineering team at EMC, Santa Clara and they were kind enough to let us use the storage infrastructure in their lab. They loaned us three CLARiiON CX3-80 storage arrays, each with 165 15K RPM disks, for a total of 495 disks and 77TB of storage. Our experiments used the Iometer I/O stress tool running in virtual machines on a server equipped with ESX 3.5 Update 1. The server was a quad-core, quad-socket (16 cores total) system with 32GB of physical memory.

We ramped up the I/O rate on the system while keeping a close eye on the I/O latencies. We managed to achieve over 100K IOPS before running out of disk bandwidth on the storage arrays. And we still had plenty of headroom to spare on the server running ESX. To put this into perspective, the 77TB of raw storage used in these experiments is enough to hold the entire printed Library of Congress. You'd need to run 200,000 Microsoft Exchange mailboxes (LoadGen heavy user profile) or 85 average 4-way database servers to generate an I/O rate of 100K IOPS.

The sections below present a more detailed discussion of the set of experiments we ran and provide additional information on the experimental configuration.

Details of I/O Workload

We chose a workload that was representative of most common transaction oriented applications. We defined an I/O pattern with 8KB block size, 50% read + 50% write, and 100% random access. Enterprise applications like Microsoft Exchange and transaction-oriented databases like Oracle and Microsoft SQL Server use similar I/O patterns. Figure 1 shows a screen shot of the Iometer access specification.

Iometerworkloadspec

Figure 1. Iometer Access Specification

Experiments and Results

Our first set of experiments was to show the I/O scalability of ESX. We started with two virtual machines and doubled the number of virtual machines each time while keeping the outstanding I/Os constant at eight. Figure 2 shows the variation of I/O and latency as the number of virtual machines increases.

Figure2_3

Figure 2. I/O Operations Per Second and Latency of I/O Operations vs. Number of Virtual Machines

As seen in Figure 2, IOPS scale well with the number of virtual machines, while the latency of each I/O access increases only marginally.

In another set of experiments, we wanted to demonstrate the capabilities of ESX to absorb the burstiness of I/O workloads running in multiple virtual machines. On our ESX host we powered on 16 virtual machines and ran a series of tests, gradually increasing the number of outstanding I/Os from 1 to 12.

Figure3

Figure 3. IOPS and Latency of I/O Operations vs. Number of Outstanding I/Os per LUN

As seen in Figure 3, the number of IOPS increased in a nearly linear fashion with increase in number of outstanding I/Os, as did the latency of I/O access. However the rate of increase in number of IOPS was faster than the rate of increase of latency until six outstanding I/Os. Beyond six outstanding I/Os, the latency increased faster than the number of IOPS, probably due to queuing in the storage array as its components were operating close to saturation.

To confirm that the increase in latency was not due to ESX but was instead because of queuing in the storage array, we measured the response time of each LUN in the storage array used for test virtual disks using Navisphere Analyzer for different outstanding I/Os. Figure 4 shows a comparison of the corresponding latencies seen in the guest (measured using PerfMon) with those seen in storage.

Figure4

Figure 4. Disk Response Time

As seen from the graph, the I/O latency measured in the guest is very close to the latency measured in storage. This indicates that there is no queuing at any layer (in the guest, ESX, or the HBAs) other than storage and the response time of an I/O access seen in the guest is mostly due to the response time of the storage.

Our experiments show that ESX can easily scale to above 100,000 IOPS. We could have gone well beyond 100,000, but that was as far as we could stretch it with the number of disks we were able to get at short notice. Increasing outstanding I/Os did not help further as it only increased the latency. 100,000 IOPS in itself is a very high I/O throughput being driven by just one ESX host with several virtual machines running on it. The I/O latencies were still within acceptable limits and were mainly due to storage response time.

This study would not have been possible without the help of Bob Ng and Kris Chau, who set up the storage infrastructure in a very short time. When we were in need of additional disks to drive more I/O, Kathleen Sharp quickly got us the third storage array which helped us to complete our experiments. I would like to thank all of them and acknowledge that without their support, this work wouldn’t have been possible.

Summary

As enterprises are moving towards a virtualized data center, more and more virtual servers are being deployed on fewer physical systems running ESX. In order to facilitate smooth migration towards virtualization, ESX has to be capable of meeting the I/O access demands of virtual servers running a wide variety of applications. In this study, we have shown that ESX can easily support 100,000 IOPS for random access patterns which have a mix of read and write and a block size of 8KB. These high watermark tests show that the I/O stack in VMware ESX is highly scalable and can be used with confidence in data centers running workloads with heavy I/O profiles in virtual machines.

Configuration Details

Hardware:

Server:

  • 4 Intel Tigerton processors (a total of 16 cores)
  • 32GB of physical memory
  • Two on-board gigabit Ethernet controllers, two Intel gigabit network adapters
  • Two dual-port QLogic 2462 HBAs (4GBps) and two single port QLogic 2460 HBAs (4GBps)

Storage:

  • Three CX3-80 arrays, each with 165 15K RPM Fibre Channel disks
  • Flare OS: 03.26.080.3.086
  • Read cache: 1GB (per storage processor)
  • Write cache: 3GB (per array)

ESX:

  • ESX 3.5 update 1

Virtual Machines:

  • 1 virtual processor
  • 1GB virtual memory
  • 1 virtual NIC with Intel e1000 driver
  • Guest OS: Windows Server 2003 Enterprise Edition (64-bit) with Service Pack 2

I/O Stress Tool:

  • Iometer version 2006.07.27

Storage Layout

To drive over 100,000 IOPS all the available disks in the storage systems were used. A total of 100 virtual disks, each 40GB in size, were created and distributed among the virtual machines. These were on 100GB LUNs, 98 of which were created on five-disk RAID 0 groups while 2 of them were on LUNs hosted on separate single disks. All LUNs were formatted with VMFS3 filesystem. The 4TB of virtual disk space eliminated any read-caching effect from the storage array.

A three-disk RAID 0 group was created in one of the storage arrays and a 400GB LUN was created in this RAID group. The LUN was then formatted with the VMFS3 file system. A 6GB virtual disk was created for each virtual machine in this VMFS partition. These virtual disks were used to install the guest operating system for each virtual machine.

Both storage processors on each storage array were connected to one of the six QLogic HBA ports on the server.

Iometer Setup

We ran Iometer console on a client machine and Dynamo in each of the virtual machines. This enabled us to control the I/O workload in each virtual machine through one console. The outstanding I/Os and I/O access specifications were identical for each virtual machine for a given test case.

Tuning for Performance

You might be wondering what parameters we tuned to obtain this kind of performance. The answer will surprise most: we tuned only three parameters to obtain 100,000+ IOPS.

  • We increased the VMFS3 max heap size from 16MB to 64MB (KB article # 1004424).
  • We changed the storage processor’s cache high/low watermark from 80/60 to 40/20. This was done to write the dirty pages in storage cache more often so that Iometer write operations do not wait for free memory buffers.
  • We increased the guest queue length to 100 to make sure that the guest was capable of queuing all the I/O accesses generated by Iometer to the test disks.

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8341c328153ef00e5526d38268834

Listed below are links to weblogs that reference 100,000 I/O Operations Per Second, One ESX Host:

Comments

Shawn

Could someone put those numbers into context?
For example, what would the I/O per second be for a typical 500 user Exchange server, or any other common server workload.
Thanks.

chethan

To put the storage size and throughput into perspective:
- 77TB of disk space enough to hold the entire printed library of congress
- 200,000 Microsoft Exchange mailboxes (heavy loadgen user)
- 150 average 4-way database servers

Chethan

Correction to my previous comment: number of average 4-way database servers that can generate 100K IOPs is about 85, not 150 as mentioned before.

Todd Muirhead

Great work! I wouldn't have thought that you could get that many IOPS from a single ESX Server. How did you guys come up with the idea to do the test?

I did some testing with Exchange 2007 in VMs and found that storage performance was very simliar to what you would see with a physical server based Exchange 2007, but it wasn't anywhere near this level of IOPS.

Todd

Chad Sakac

Disclosure: I'm an EMC employee.

The idea for the test came from VMware - they often need to deal with the "VMware IO doesn't scale" uncertainty around large scale apps in VMware(and sometimes competitive FUD) in the field.

They approached us (EMC) and asked "what do you guys say?", answer: we're always game for something fun like this!!!! (BTW - VMware is always fair and even handed - the offer is open to the other storage vendors also)

Crazy thing is we're just getting started. Here we saturated our mid-range arrays. We just shipped a high-end array with Enterprise Solid-State disks to the same lab, and are ready for round 2. I bet we'll eventually hit 200K or even more.

Now, these are a bit ridiculous numbers, but they make the point - VMware scales to a wide, wide range of workloads and IO profiles (I have yet to find one where done right it doesn't work).

You can get great results even with low-cost, low-end configurations. But when you need to scale, VMware is ready for you.

question

Why do you need to scale the number of VMs in order to sclae the number of IOPS? (as shown in first bar chart)

If I want to run a single SQL Server database it will have to reside inside a single VM.

Can you run/post information on the max IOPS I can get from a single VM?

Doug

In my experience, people want the network, storage and virtualization layer to handle application architecture bottlenecks. Throwing more power at a problem may 'fix' the problem, but normally just masks it with brute force.

I think this experiment does a great job demonstrating that many of the complaints people have regarding I/O performance in VMs are not tied to virtualization overhead in the disk I/O stack.

To me, an in-depth look at the application architecture is warranted, but that can be a lot of work and I'm not sure there is broad understanding in the app/dev community regarding best practices for coding apps that will run in virtualized environments.

Eric Schoenfeld

Why is it that VMware is publishing tests on a single ESX host?

Do VMware customers run a single ESX server?

Shows us the same results in a cluster implementation with 4-5 ESX nodes and VMFS and while you're at it, please take your measurements while doing typical administrative things, like creating VMs, extending VMDKs, VMotion etc. Last time I checked VMware users do these things. No?

OzzyJohn

Based on the comment

"A total of 100 virtual disks, each 40GB in size, were created and distributed among the virtual machines. These were on 100GB LUNs, 98 of which were created on five-disk RAID 0 groups while 2 of them were on LUNs hosted on separate single disks."

It would appear that each vmdk used in the performance test was on its own datastore, none of the datastores had any competing traffic from other ESX servers (eg. scsi-reserve/release)

If I were going to propose such a configuration I'd probably use RDMs.

Interesting, but hardly what I would call a useful benchmark that enables efficient architecture/design decisions, so it would appear this is more of a marketing than a technical exercise.

It's also notable that there was no RAID protection (RAID-0 not RAID-10) but thats more of an array performance but again, hardly a good indication of the perormance of the array either.

I wonder what would have happened if the vmdks had been on a smaller number of shared 300-500Gb datastores across three or four esx hosts which is far more typical.

Of course one takeaway from this is that in order to get the best performance you shouldn't put more than one vmdk per vmfs datastore, or have I misinterpreted the results ?

For me, what was and was not tested raises as many questions as answers

Chad Sakac

The last few comments, frankly, are right on the money. This exercise was NOT designed to produce a best-practices recommendation, or even to reflect a common customer configuration.

Common configurations have the characteristics that people point out are usually ESX clusters (which wouldn't have a material effect on outcome), use some level of RAID protection (10, 5/50, 6), have VMFS volume contention (although there is misunderstanding about when SCSI reservations are actually used - they are used not during most IO, but during metadata VMFS update operations like such as creating/deleting a virtual disk, extending a VMFS volume, or creating/deleting snapshots (this is detailed here: http://blogs.vmware.com/performance/2008/02/scalable-storag.html). In general, RDMs and VMFS have similar performance envelopes given no contention (less to do with SCSI reservations and more to do with backend spindle contention), but agreed that it's rare to have a 1:1 mapping of VMs to VMFS datastores.

The purpose wasn't purely marketing either - it's a real question that people raise often, and common misperception: "I hear that the ESX I/O subsystem doesn't scale/perform" - not true.

This was absolutely an effort to apply unreasonable brute force to show that even at the most extreme conditions, that misconception is that - a misconception.

Does this answer more questions about the right way to do it? Yes - but those questions are answered in the Best Practices guides from VMware and the I/O subsystem vendors (trying to be open and fair - I certainly can provide the EMC ones, and I'm sure others have some that are similar), and the best practices for specific large I/O use cases (we publish joint Exchange 2007, SQL Server 2005, Oracle in VI3.5 configuration best practice guides and applied technology guides - we just published one for 16K Exchange 2007 users in 500 user increments as an example).

There was only one question that this was designed to answer: "how does the I/O subsytem of an ESX server scale, with all other limits removed?", and we have an answer - beyond 100K IOPs with the given IO workload (up to the limit of the I/O backend we were able to provide for the test).

Joe

"We managed to achieve over 100K IOPS before running out of disk bandwidth on the storage arrays. And we still had plenty of headroom to spare on the server running ESX. "

Can you comment on the measured CPU utilization of the 16-core server as the IO load was scaled up?

Joe Moore

I wonder what the latency and aggregate throughput would be if this were a dedicated OS (i.e. running Linux or Windows directly on the bare metal, rather than on top of ESX)

In other words, what's the added overhead of the virtualized I/O system?

--Joe

Kaushik Banerjee

The added overhead in terms of latency by ESX is extremely small. This is apparent from Fig 4. which shows the latency from the array using monitoring within the array and the latency as seen by perfmon within the guest OS on ESX. As can be seen, those numbers are pretty close.

Travis Wood

Where was the guest queue length increased? Was this the Disk.SchedNumReqOutstanding setting or a queue length within Windows?

Chethan

Guest Queue length was increased in the Windows guest(through registry settings) to increase the maximum number of guaranteed concurrent I/Os.

JR

"Two dual-port QLogic 2462 HBAs (4GBps) and two single port QLogic 2460 HBAs (4GBps)
"

The maximum number of hw iscsi initiators in ESX 3.5 is 2. How did you manage 6??? I've found that I can maximize my I/O by alternating initiator preference accross high demand volumes. If IO is spread between 2 or more volumes via a single hw initiator, it runs into the 1Gbs limit of the hba. More than 2 initiators would provide greater IO to 3 or more volumes per ESX server and therfore per VM.

6 initiators would fit nicely on an ESX server hosting a vm that could use those greater IO's per volume. MSSQL server, for example, would be able to utilize that additional IO for 4 volumes. (OS, data, logs, tempdb)

Is there a way to use more than 2 hw initiators or did I misinterpret the config?

Chethan

JR,

You are referring to the maximum number of hardware iSCSI iniators a single ESX host (3.5) can support. We used Fibre Channel HBAs in our experiments (Qlogic 2462 and 2460). A single ESX Server (3.5) can have a maximum of 16 HBAs (4 HBAs in our experiments are well within this limit).

IT_Architect

>The purpose wasn't purely marketing either - it's a real question that people raise often, and common misperception: "I hear that the ESX I/O subsystem doesn't scale/perform" - not true.<

Thank you for that test, and yes LOTS of people run a single ESXi box. I realize the site it is coming from, but I do need something before I start my own testing with a local array.

Mike Ault

As to whether or not 100,000 IOPS can be reached in the real world, using a standard set of TPC-H queries that simulate a large data warehouse (300GB or larger) you can easily reach 100,000 IOPS if your storage subsystem can support it. Between index, data and temporary space activity in Oracle, for example, you can easily peak at over 100,000 IOPS if your latency will allow it.

Dave Mc

I'm dealing with a disk IO issue. We are a VM hosting provider with multiple customers, 100+VMs, 4 ESX 3.5 hosts and a 4GB FC SAN. The problem is we have individual VMs which are monopolizing our IO to the point that the whole storage system is slowed, impacting all customers. This is a result of some of the architecture decisions we made but I’m hoping to find a solution. I can't come up with a way to limit an individual VM's ability to use all the storage IO and create a slow down system wide. We have a FC SAN with two large 30 drive RAID arrays which we then hand to our storage virtualization layer in 2TB LUNS. Our storage virtualization tool then hands LUNS to our ESX cluster with each customer getting their own virtual LUN (each customer may have 1-12VMs). The virtualization tool can combine storage from different SAN LUNS into new LUNS for the ESX cluster. Some of the virtual LUNS are combined from SAN LUNS and a couple of them cross arrays. Solution needed: method to limit the IO/Throughput of an individual VM so that it doesn't use all/most of the IO available. The traditional "shares" solution won't work because it only limits the proportion of IO that the VM gets within its resource pool. It doesn't limit the total IO that a VM can use. Therefore, if my storage is capable of 7000IOPS, the VM will still push the storage to 7000IOPS it might just have to share some time with other VMs in the pool. Another alternative is to put these problem VMs on dedicated spindles but this is expensive and not very scalable (it is also reactionary). Would the new Adaptive Queue depth algorithm do me any good (I don’t have 3Par)? If I understand the article, this would slow down all the IO on one host, thereby allowing the other hosts to have more IO but it doesn’t allow me to get granular to a VM level. Therefore, I don’t think it is the solution I need. http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1008113
Other suggestions?

Post a comment

If you have a TypeKey or TypePad account, please Sign In.