Scalable Storage Performance with VMware ESX Server 3.5

We at VMware often get questions about how aggressively physical systems can be consolidated. Scalability on heavily-consolidated systems is not just a nice feature of VMware ESX Server, but is a requirement to support demanding applications in modern datacenters. With the launch of VI3 with ESX Server 3.5 we’ve further improved the efficiency of our storage system. For non-clustered environments, we’ve already shown in this comparison paper that our system overheads are negligible compared to physical devices. In this article we’d like to cover the scalable performance of VMFS, our clustered file system.

ESX Server enables multiple hosts to reliably share the same physical storage through its highly optimized storage stack and the VMFS file system. There are many benefits to a shared storage infrastructure, such as consolidation and live migration, but people commonly wonder about performance. While it is always desirable to squeeze the most performance out of the storage system, care should be taken not to severely over-commit the available resources, which can lead to performance degradation. Specifically, the primary factors that affect the shared storage performance of an ESX Server cluster are as follows:

1. The number of outstanding SCSI commands going to a shared LUN

SCSI allows multiple commands to be active on a link, and SCSI drivers support a configurable parameter called “queue depth” to control this. The maximum supported value is most commonly 256. For an I/O group (ESX Server(s) – LUN), it is important that the number of active SCSI commands does not exceed this value, otherwise the commands will get queued. Excessive queuing leads to increased latencies and potentially a drop in throughput. The number of commands queued per ESX Server host can be derived using the esxtop command.

2. SCSI reservations

VMFS is a clustered file system and uses SCSI reservations to implement on-disk locks. Administrative operations, such as creating/deleting a virtual disk, extending a VMFS volume, or creating/deleting snapshots, result in metadata updates to the file system using locks, and hence result in SCSI reservations. A reservation causes the LUN to be available exclusively to a single ESX Server host for a brief period of time. It is therefore preferable that administrators perform the above-mentioned operations during off-peak hours, especially if there will be many of them.

3. Storage device capabilities

The capabilities of the storage array play a role in how well performance scales with multiple ESX Servers. The capabilities include the maximum LUN queue depth, the cache size, the number of sequential streams, and other vendor-specific enhancements. Our results have shown that most modern Fibre Channel storage arrays have enough capacity to provide good performance in an ESX Server cluster.

We’re glad to share with you some results from our storage scalability experiments. Our hardware setup includes 64 blades running VMware ESX Server 3.5. They are connected to a storage array via 2Gbps Fibre Channel links. All hosts share a single VMFS volume, and virtual machines running IOmeter generate a heavy I/O load to that one volume. The queue depth for the Fibre Channel HBA is set to 32 on each ESX Server host, which is exactly how many commands are configured to be generated by all virtual machines on a single host. We measure two things:

 Aggregate Throughput – the sum of the throughput across all virtual machines on all hosts

 Average Latency – the end-to-end average delay per command as seen by any virtual machine in the cluster



                                                       Figure 1

It is clear from Figure 1 that except for sequential read there is no drop in aggregate throughput as we scale the number of hosts. The reason sequential read drops is that the sequential streams coming in from different ESX Server hosts are no longer sequential when intermixed at the storage array, and thus become random. Writes generally do better than reads because they are absorbed by the write cache and flushed to disks in the background.


                                                        Figure 2

Figure 2 illustrates the effect of commands from all ESX Server hosts reaching the shared LUN on the storage array. Each ESX Server host generates 32 commands, hence at eight hosts we have reached the recommended maximum per LUN of 256. Beyond this point, latencies climb upwards of 100 msec, and could affect applications that are sensitive to latencies, although there is no drop in aggregate throughput.

These experiments represent a specific configuration with an aggressive I/O rate. Virtual machines deployed in typical customer environments may not have as high a rate and therefore may be able to scale further. In general, because of varying block sizes, access patterns, and number of outstanding commands, the results you see in your VMware environment will depend on the types of applications running. The results will also depend on the capabilities of your storage and whether it is tuned for the block sizes in your application. Also, processing very small commands adds some compute overhead in any system, be it virtualized or otherwise. Overall, the ESX Server storage stack is well tuned to run a majority of applications. If you are using iSCSI or NFS, this comparison paper nicely outlines how ESX Server can efficiently make use of the full Ethernet link speed for most block sizes.We’re always pleased to show the scalability of VMware Infrastructure 3, and the file system that supports the VI3 features is a good example. Look for more details on storage and VMFS performance in the form of whitepapers and presentations from VMware and its partners in the coming weeks.


5 comments have been added so far

  1. It would be interesting to see the VM counts per host for this data, and the impact on the increase in number of VMs per volume, per host, etc.

  2. Your statement “The maximum supported value is most commonly 256. For an I/O group (ESX Server(s) – LUN), it is important that the number of active SCSI commands does not exceed this value” what does it mean in terms of VMDK? How many VMDK I should be placing so that it should not exceed this value? My understanding about the queue depth is I can change the queue depth at the host level to match that with Frame and I can still push that many I/O from the host. For example if we see the queue depth of Frame around 1000 and if you have set the host at around 540 then you can get into potential problem. I can see some SCSI abort into my vmkernal logs. To fix this I can change the queue depth at host and that can take care of this error.
    I also would like to know what was the block size of the VMDK and was it aligned during your test ?

  3. As far as I can tell, the topics discussed here are specific to VMFS filesystems.
    I wonder how this relates to NFS, where things like Queue Depth do not apply.

  4. Luke – interesting question.
    Disclosure – I’m an EMC employee.
    I can’t speak for this test (which was indeed focused on VMFS which by definition is block).
    What I can say is that we did find even with NFS that there was some scaling with multiple mounts. While the concept of queue depth at the ESX server doesn’t apply (it applies at the NFS server, like an EMC Celerra or a NetApp FAS platform) – there are analagous ESX host parameters. Having multiple NFS mounts to multiple NFS export IP aliases increases the number of TCP sessions.
    There are also a bunch of important NFS export options –
    If this is a topic that needs more info – I’m happy to write a post on it here – http://virtualgeek.typepad.com

Leave a Reply

Your email address will not be published. Required fields are marked *