Home > Blogs > VMware VROOM! Blog


100,000 I/O Operations Per Second, One ESX Host

The performance of I/O is critical to achieving good overall performance on
enterprise applications. Workloads like transaction processing systems, web
applications, and mail servers are sensitive to the throughput and latency of
the I/O subsystem. In order for VMware ESX
to run these applications well, it needs to push large amounts of I/O without
adding significant latencies.

To demonstrate the scalability of the ESX I/O stack, we decided to see if
ESX could sustain 100,000 IOPS. Many enterprise applications access their data
in relatively small I/O blocks placed throughout the dataset. So the metric we want to focus on is random
I/O throughput, measured in I/O operations per second (IOPS), rather than raw
bandwidth.  We used a workload that was
100% random with a 50/50 read/write mix and an 8KB block size.

The next step was to get our hands on enough storage to run the experiments
on a large scale. We went to the Midrange Partner Solutions Engineering team at
EMC, Santa Clara and they were kind enough to let us use the storage infrastructure in their
lab. They loaned us three CLARiiON CX3-80 storage arrays, each with 165 15K RPM
disks, for a total of 495 disks and 77TB of storage. Our experiments used the
Iometer I/O stress tool running in virtual machines on a server equipped with
ESX 3.5 Update 1. The server was a quad-core, quad-socket (16 cores total)
system with 32GB of physical memory.

We ramped up the I/O rate on the system while keeping a close eye on the I/O
latencies. We managed to achieve over 100K IOPS before running out of disk
bandwidth on the storage arrays. And we still had plenty of headroom to spare
on the server running ESX. To put this
into perspective, the 77TB of raw storage used in these experiments is enough
to hold the entire printed Library of Congress. You’d need to run 200,000
Microsoft Exchange mailboxes (LoadGen heavy user profile) or 85 average 4-way
database servers to generate an I/O rate of 100K IOPS.

The sections below present a more detailed discussion of the set of
experiments we ran and provide additional information on the experimental
configuration.

Details of I/O Workload

We chose a workload that was representative of most common transaction oriented applications. We defined an I/O pattern with 8KB block size, 50% read + 50% write, and 100% random access. Enterprise applications like Microsoft Exchange and transaction-oriented databases like Oracle and Microsoft SQL Server use similar I/O patterns. Figure 1 shows a screen shot of the Iometer access specification.

Iometerworkloadspec

Figure 1. Iometer Access Specification

Experiments and Results

Our first set of experiments was to show the I/O scalability of ESX. We started with two virtual machines and doubled the number of virtual machines each time while keeping the outstanding I/Os constant at eight. Figure 2 shows the variation of I/O and latency as the number of virtual machines increases.

Figure2_3

Figure 2. I/O Operations Per Second and Latency of I/O Operations vs. Number of Virtual Machines

As seen in Figure 2, IOPS scale well with the number of virtual machines, while the latency of each I/O access increases only marginally.

In another set of experiments, we wanted to demonstrate the capabilities of ESX to absorb the burstiness of I/O workloads running in multiple virtual machines. On our ESX host we powered on 16 virtual machines and ran a series of tests, gradually increasing the number of outstanding I/Os from 1 to 12.

Figure3

Figure 3. IOPS and Latency of I/O Operations vs. Number of Outstanding I/Os per LUN

As seen in Figure 3, the number of IOPS increased in a nearly linear fashion with increase in number of outstanding I/Os, as did the latency of I/O access. However the rate of increase in number of IOPS was faster than the rate of increase of latency until six outstanding I/Os. Beyond six outstanding I/Os, the latency increased faster than the number of IOPS, probably due to queuing in the storage array as its components were operating close to saturation.

To confirm that the increase in latency was not due to ESX but was instead because of queuing in the storage array, we measured the response time of each LUN in the storage array used for test virtual disks using Navisphere Analyzer for different outstanding I/Os. Figure 4 shows a comparison of the corresponding latencies seen in the guest (measured using PerfMon) with those seen in storage.

Figure4

Figure 4. Disk Response Time

As seen from the graph, the I/O latency measured in the guest is very close to the latency measured in storage. This indicates that there is no queuing at any layer (in the guest, ESX, or the HBAs) other than storage and the response time of an I/O access seen in the guest is mostly due to the response time of the storage.

Our experiments show that ESX can easily scale to above 100,000 IOPS. We could have gone well beyond 100,000, but that was as far as we could stretch it with the number of disks we were able to get at short notice. Increasing outstanding I/Os did not help further as it only increased the latency. 100,000 IOPS in itself is a very high I/O throughput being driven by just one ESX host with several virtual machines running on it. The I/O latencies were still within acceptable limits and were mainly due to storage response time.

This study would not have been possible without the help of Bob Ng and Kris Chau, who set up the storage infrastructure in a very short time. When we were in need of additional disks to drive more I/O, Kathleen Sharp quickly got us the third storage array which helped us to complete our experiments. I would like to thank all of them and acknowledge that without their support, this work wouldn’t have been possible.

Summary

As enterprises are moving towards a virtualized data center, more and more virtual servers are being deployed on fewer physical systems running ESX. In order to facilitate smooth migration towards virtualization, ESX has to be capable of meeting the I/O access demands of virtual servers running a wide variety of applications. In this study, we have shown that ESX can easily support 100,000 IOPS for random access patterns which have a mix of read and write and a block size of 8KB. These high watermark tests show that the I/O stack in VMware ESX is highly scalable and can be used with confidence in data centers running workloads with heavy I/O profiles in virtual machines.

Configuration Details

Hardware:

Server:

  • 4 Intel Tigerton processors (a total of 16 cores)
  • 32GB of physical memory
  • Two on-board gigabit Ethernet controllers, two Intel gigabit network adapters
  • Two dual-port QLogic 2462 HBAs (4GBps) and two single port QLogic 2460 HBAs (4GBps)

Storage:

  • Three CX3-80 arrays, each with 165 15K RPM Fibre Channel disks
  • Flare OS: 03.26.080.3.086
  • Read cache: 1GB (per storage processor)
  • Write cache: 3GB (per array)

ESX:

  • ESX 3.5 update 1

Virtual Machines:

  • 1 virtual processor
  • 1GB virtual memory
  • 1 virtual NIC with Intel e1000 driver
  • Guest OS: Windows Server 2003 Enterprise Edition (64-bit) with Service Pack 2

I/O Stress Tool:

  • Iometer version 2006.07.27

Storage Layout

To drive over 100,000 IOPS all the available disks in the storage systems were used. A total of 100 virtual disks, each 40GB in size, were created and distributed among the virtual machines. These were on 100GB LUNs, 98 of which were created on five-disk RAID 0 groups while 2 of them were on LUNs hosted on separate single disks. All LUNs were formatted with VMFS3 filesystem. The 4TB of virtual disk space eliminated any read-caching effect from the storage array.

A three-disk RAID 0 group was created in one of the storage arrays and a 400GB LUN was created in this RAID group. The LUN was then formatted with the VMFS3 file system. A 6GB virtual disk was created for each virtual machine in this VMFS partition. These virtual disks were used to install the guest operating system for each virtual machine.

Both storage processors on each storage array were connected to one of the
six QLogic HBA ports on the server.

Iometer Setup

We ran Iometer console on a client machine and Dynamo in each of the
virtual machines. This enabled us to control the I/O workload in each
virtual machine through one console. The outstanding I/Os and I/O
access specifications were identical for each virtual machine for a
given test case.

Tuning for Performance

You might be wondering what parameters we tuned to obtain this kind of performance. The answer will surprise most: we tuned only three parameters to obtain 100,000+ IOPS.

  • We increased the VMFS3 max heap size from 16MB to 64MB (KB article # 1004424).
  • We changed the storage processor’s cache high/low watermark from 80/60 to 40/20. This was done to write the dirty pages in storage cache more often so that Iometer write operations do not wait for free memory buffers.
  • We increased the guest queue length to 100 to make sure that the guest was capable of queuing all the I/O accesses generated by Iometer to the test disks.

25 thoughts on “100,000 I/O Operations Per Second, One ESX Host

  1. Shawn

    Could someone put those numbers into context?
    For example, what would the I/O per second be for a typical 500 user Exchange server, or any other common server workload.
    Thanks.

    Reply
  2. chethan

    To put the storage size and throughput into perspective:
    - 77TB of disk space enough to hold the entire printed library of congress
    - 200,000 Microsoft Exchange mailboxes (heavy loadgen user)
    - 150 average 4-way database servers

    Reply
  3. Chethan

    Correction to my previous comment: number of average 4-way database servers that can generate 100K IOPs is about 85, not 150 as mentioned before.

    Reply
  4. Todd Muirhead

    Great work! I wouldn’t have thought that you could get that many IOPS from a single ESX Server. How did you guys come up with the idea to do the test?
    I did some testing with Exchange 2007 in VMs and found that storage performance was very simliar to what you would see with a physical server based Exchange 2007, but it wasn’t anywhere near this level of IOPS.
    Todd

    Reply
  5. Chad Sakac

    Disclosure: I’m an EMC employee.
    The idea for the test came from VMware – they often need to deal with the “VMware IO doesn’t scale” uncertainty around large scale apps in VMware(and sometimes competitive FUD) in the field.
    They approached us (EMC) and asked “what do you guys say?”, answer: we’re always game for something fun like this!!!! (BTW – VMware is always fair and even handed – the offer is open to the other storage vendors also)
    Crazy thing is we’re just getting started. Here we saturated our mid-range arrays. We just shipped a high-end array with Enterprise Solid-State disks to the same lab, and are ready for round 2. I bet we’ll eventually hit 200K or even more.
    Now, these are a bit ridiculous numbers, but they make the point – VMware scales to a wide, wide range of workloads and IO profiles (I have yet to find one where done right it doesn’t work).
    You can get great results even with low-cost, low-end configurations. But when you need to scale, VMware is ready for you.

    Reply
  6. question

    Why do you need to scale the number of VMs in order to sclae the number of IOPS? (as shown in first bar chart)
    If I want to run a single SQL Server database it will have to reside inside a single VM.
    Can you run/post information on the max IOPS I can get from a single VM?

    Reply
  7. Doug

    In my experience, people want the network, storage and virtualization layer to handle application architecture bottlenecks. Throwing more power at a problem may ‘fix’ the problem, but normally just masks it with brute force.
    I think this experiment does a great job demonstrating that many of the complaints people have regarding I/O performance in VMs are not tied to virtualization overhead in the disk I/O stack.
    To me, an in-depth look at the application architecture is warranted, but that can be a lot of work and I’m not sure there is broad understanding in the app/dev community regarding best practices for coding apps that will run in virtualized environments.

    Reply
  8. Eric Schoenfeld

    Why is it that VMware is publishing tests on a single ESX host?
    Do VMware customers run a single ESX server?
    Shows us the same results in a cluster implementation with 4-5 ESX nodes and VMFS and while you’re at it, please take your measurements while doing typical administrative things, like creating VMs, extending VMDKs, VMotion etc. Last time I checked VMware users do these things. No?

    Reply
  9. OzzyJohn

    Based on the comment
    “A total of 100 virtual disks, each 40GB in size, were created and distributed among the virtual machines. These were on 100GB LUNs, 98 of which were created on five-disk RAID 0 groups while 2 of them were on LUNs hosted on separate single disks.”
    It would appear that each vmdk used in the performance test was on its own datastore, none of the datastores had any competing traffic from other ESX servers (eg. scsi-reserve/release)
    If I were going to propose such a configuration I’d probably use RDMs.
    Interesting, but hardly what I would call a useful benchmark that enables efficient architecture/design decisions, so it would appear this is more of a marketing than a technical exercise.
    It’s also notable that there was no RAID protection (RAID-0 not RAID-10) but thats more of an array performance but again, hardly a good indication of the perormance of the array either.
    I wonder what would have happened if the vmdks had been on a smaller number of shared 300-500Gb datastores across three or four esx hosts which is far more typical.
    Of course one takeaway from this is that in order to get the best performance you shouldn’t put more than one vmdk per vmfs datastore, or have I misinterpreted the results ?
    For me, what was and was not tested raises as many questions as answers

    Reply
  10. Chad Sakac

    The last few comments, frankly, are right on the money. This exercise was NOT designed to produce a best-practices recommendation, or even to reflect a common customer configuration.
    Common configurations have the characteristics that people point out are usually ESX clusters (which wouldn’t have a material effect on outcome), use some level of RAID protection (10, 5/50, 6), have VMFS volume contention (although there is misunderstanding about when SCSI reservations are actually used – they are used not during most IO, but during metadata VMFS update operations like such as creating/deleting a virtual disk, extending a VMFS volume, or creating/deleting snapshots (this is detailed here: http://blogs.vmware.com/performance/2008/02/scalable-storag.html). In general, RDMs and VMFS have similar performance envelopes given no contention (less to do with SCSI reservations and more to do with backend spindle contention), but agreed that it’s rare to have a 1:1 mapping of VMs to VMFS datastores.
    The purpose wasn’t purely marketing either – it’s a real question that people raise often, and common misperception: “I hear that the ESX I/O subsystem doesn’t scale/perform” – not true.
    This was absolutely an effort to apply unreasonable brute force to show that even at the most extreme conditions, that misconception is that – a misconception.
    Does this answer more questions about the right way to do it? Yes – but those questions are answered in the Best Practices guides from VMware and the I/O subsystem vendors (trying to be open and fair – I certainly can provide the EMC ones, and I’m sure others have some that are similar), and the best practices for specific large I/O use cases (we publish joint Exchange 2007, SQL Server 2005, Oracle in VI3.5 configuration best practice guides and applied technology guides – we just published one for 16K Exchange 2007 users in 500 user increments as an example).
    There was only one question that this was designed to answer: “how does the I/O subsytem of an ESX server scale, with all other limits removed?”, and we have an answer – beyond 100K IOPs with the given IO workload (up to the limit of the I/O backend we were able to provide for the test).

    Reply
  11. Joe

    “We managed to achieve over 100K IOPS before running out of disk bandwidth on the storage arrays. And we still had plenty of headroom to spare on the server running ESX. ”
    Can you comment on the measured CPU utilization of the 16-core server as the IO load was scaled up?

    Reply
  12. Joe Moore

    I wonder what the latency and aggregate throughput would be if this were a dedicated OS (i.e. running Linux or Windows directly on the bare metal, rather than on top of ESX)
    In other words, what’s the added overhead of the virtualized I/O system?
    –Joe

    Reply
  13. Kaushik Banerjee

    The added overhead in terms of latency by ESX is extremely small. This is apparent from Fig 4. which shows the latency from the array using monitoring within the array and the latency as seen by perfmon within the guest OS on ESX. As can be seen, those numbers are pretty close.

    Reply
  14. Travis Wood

    Where was the guest queue length increased? Was this the Disk.SchedNumReqOutstanding setting or a queue length within Windows?

    Reply
  15. Chethan

    Guest Queue length was increased in the Windows guest(through registry settings) to increase the maximum number of guaranteed concurrent I/Os.

    Reply
  16. JR

    “Two dual-port QLogic 2462 HBAs (4GBps) and two single port QLogic 2460 HBAs (4GBps)

    The maximum number of hw iscsi initiators in ESX 3.5 is 2. How did you manage 6??? I’ve found that I can maximize my I/O by alternating initiator preference accross high demand volumes. If IO is spread between 2 or more volumes via a single hw initiator, it runs into the 1Gbs limit of the hba. More than 2 initiators would provide greater IO to 3 or more volumes per ESX server and therfore per VM.
    6 initiators would fit nicely on an ESX server hosting a vm that could use those greater IO’s per volume. MSSQL server, for example, would be able to utilize that additional IO for 4 volumes. (OS, data, logs, tempdb)
    Is there a way to use more than 2 hw initiators or did I misinterpret the config?

    Reply
  17. Chethan

    JR,
    You are referring to the maximum number of hardware iSCSI iniators a single ESX host (3.5) can support. We used Fibre Channel HBAs in our experiments (Qlogic 2462 and 2460). A single ESX Server (3.5) can have a maximum of 16 HBAs (4 HBAs in our experiments are well within this limit).

    Reply
  18. IT_Architect

    >The purpose wasn’t purely marketing either – it’s a real question that people raise often, and common misperception: “I hear that the ESX I/O subsystem doesn’t scale/perform” – not true.<
    Thank you for that test, and yes LOTS of people run a single ESXi box. I realize the site it is coming from, but I do need something before I start my own testing with a local array.

    Reply
  19. Mike Ault

    As to whether or not 100,000 IOPS can be reached in the real world, using a standard set of TPC-H queries that simulate a large data warehouse (300GB or larger) you can easily reach 100,000 IOPS if your storage subsystem can support it. Between index, data and temporary space activity in Oracle, for example, you can easily peak at over 100,000 IOPS if your latency will allow it.

    Reply
  20. Dave Mc

    I’m dealing with a disk IO issue. We are a VM hosting provider with multiple customers, 100+VMs, 4 ESX 3.5 hosts and a 4GB FC SAN. The problem is we have individual VMs which are monopolizing our IO to the point that the whole storage system is slowed, impacting all customers. This is a result of some of the architecture decisions we made but I’m hoping to find a solution. I can’t come up with a way to limit an individual VM’s ability to use all the storage IO and create a slow down system wide. We have a FC SAN with two large 30 drive RAID arrays which we then hand to our storage virtualization layer in 2TB LUNS. Our storage virtualization tool then hands LUNS to our ESX cluster with each customer getting their own virtual LUN (each customer may have 1-12VMs). The virtualization tool can combine storage from different SAN LUNS into new LUNS for the ESX cluster. Some of the virtual LUNS are combined from SAN LUNS and a couple of them cross arrays. Solution needed: method to limit the IO/Throughput of an individual VM so that it doesn’t use all/most of the IO available. The traditional “shares” solution won’t work because it only limits the proportion of IO that the VM gets within its resource pool. It doesn’t limit the total IO that a VM can use. Therefore, if my storage is capable of 7000IOPS, the VM will still push the storage to 7000IOPS it might just have to share some time with other VMs in the pool. Another alternative is to put these problem VMs on dedicated spindles but this is expensive and not very scalable (it is also reactionary). Would the new Adaptive Queue depth algorithm do me any good (I don’t have 3Par)? If I understand the article, this would slow down all the IO on one host, thereby allowing the other hosts to have more IO but it doesn’t allow me to get granular to a VM level. Therefore, I don’t think it is the solution I need. http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1008113
    Other suggestions?

    Reply
  21. Dpironet

    @Dave Mc
    You could tweak down the VMs monopolizing storage I/O by setting a registry key that will decrease the queue depth.
    i.e. for LSI FC add the following reg key:
    HKLM\SYSTEM\CurrentControlSet\Services\Lsi_fc\Parameters\Device\ DWORD ,0×40 (64 decimal)
    64 is the default, test with 32 for instance…
    DWORD is a common one but not recognized by all vendors…
    Rgds,
    Didier

    Reply
  22. Dpironet

    Oops look like this blog does’t like brackets…
    @Dave Mc
    You could tweak down the VMs monopolizing storage I/O by setting a registry key that will decrease the queue depth.
    i.e. for LSI FC add the following reg key:
    HKLM\SYSTEM\CurrentControlSet\Services\Lsi_fc\Parameters\Device\MaximumTargetQueueDepth ,0×40 (64 decimal)
    64 is the default, test with 32 for instance…
    /MAXTAGS=nnn added to Driver Parameters is a common one but not recognized by all vendors…
    Rgds,
    Didier

    Reply
  23. emax

    “To drive over 100,000 IOPS all the available disks in the storage systems were used. A total of 100 virtual disks, each 40GB in size, were created and distributed among the virtual machines. These were on 100GB LUNs, 98 of which were created on five-disk RAID 0 groups while 2 of them were on LUNs hosted on separate single disks”
    What significance does the 2 seperate disks have? were they just used seperatly because they were left over from the 5 disk raid groups?
    How many virtual disks did each VM receive? The IOmeter test was run ONLY on the 40GB virtual disks correct? not the OS virtual disks?
    Please advise. I am looking to setting up my own similar test in my lab. thank you in advance.

    Reply
  24. ckumar

    Emax,
    >> What significance does the 2 seperate disks have? were they just used seperatly because they were left over from the 5 disk raid groups?
    They couldn’t be RAIDed as there was not enough disks to create a RAID. Hence they were used as stand alone disks.
    >> How many virtual disks did each VM receive? The IOmeter test was run ONLY on the 40GB virtual disks correct? not the OS virtual disks?
    Each VM had equal number of disks with 2 VMs having 1 extra virtual disk. Iometer test was run ONLY on the 40GB virtual disks.
    I would suggest you to look at more recent experiments conducted with vSphere and discussed in this paper:
    http://www.vmware.com/resources/techresources/10054

    Reply
  25. Pingback: Ảo Hoá » Blog Archive » ESXi v.s Hyper-V

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>