We at VMware often get questions about how aggressively physical systems can be consolidated. Scalability on heavily-consolidated systems is not just a nice feature of VMware ESX Server, but is a requirement to support demanding applications in modern datacenters. With the launch of VI3 with ESX Server 3.5 we’ve further improved the efficiency of our storage system. For non-clustered environments, we’ve already shown in this comparison paper that our system overheads are negligible compared to physical devices. In this article we’d like to cover the scalable performance of VMFS, our clustered file system.
ESX Server enables multiple hosts to reliably share the same physical storage through its highly optimized storage stack and the VMFS file system. There are many benefits to a shared storage infrastructure, such as consolidation and live migration, but people commonly wonder about performance. While it is always desirable to squeeze the most performance out of the storage system, care should be taken not to severely over-commit the available resources, which can lead to performance degradation. Specifically, the primary factors that affect the shared storage performance of an ESX Server cluster are as follows:
1. The number of outstanding SCSI commands going to a shared LUN
SCSI allows multiple commands to be active on a link, and SCSI drivers support a configurable parameter called “queue depth” to control this. The maximum supported value is most commonly 256. For an I/O group (ESX Server(s) – LUN), it is important that the number of active SCSI commands does not exceed this value, otherwise the commands will get queued. Excessive queuing leads to increased latencies and potentially a drop in throughput. The number of commands queued per ESX Server host can be derived using the esxtop command.
2. SCSI reservations
VMFS is a clustered file system and uses SCSI reservations to implement on-disk locks. Administrative operations, such as creating/deleting a virtual disk, extending a VMFS volume, or creating/deleting snapshots, result in metadata updates to the file system using locks, and hence result in SCSI reservations. A reservation causes the LUN to be available exclusively to a single ESX Server host for a brief period of time. It is therefore preferable that administrators perform the above-mentioned operations during off-peak hours, especially if there will be many of them.
3. Storage device capabilities
The capabilities of the storage array play a role in how well performance scales with multiple ESX Servers. The capabilities include the maximum LUN queue depth, the cache size, the number of sequential streams, and other vendor-specific enhancements. Our results have shown that most modern Fibre Channel storage arrays have enough capacity to provide good performance in an ESX Server cluster.
We’re glad to share with you some results from our storage scalability experiments. Our hardware setup includes 64 blades running VMware ESX Server 3.5. They are connected to a storage array via 2Gbps Fibre Channel links. All hosts share a single VMFS volume, and virtual machines running IOmeter generate a heavy I/O load to that one volume. The queue depth for the Fibre Channel HBA is set to 32 on each ESX Server host, which is exactly how many commands are configured to be generated by all virtual machines on a single host. We measure two things:
• Aggregate Throughput – the sum of the throughput across all virtual machines on all hosts
• Average Latency – the end-to-end average delay per command as seen by any virtual machine in the cluster
Figure 1
It is clear from Figure 1 that except for sequential read there is no drop in aggregate throughput as we scale the number of hosts. The reason sequential read drops is that the sequential streams coming in from different ESX Server hosts are no longer sequential when intermixed at the storage array, and thus become random. Writes generally do better than reads because they are absorbed by the write cache and flushed to disks in the background.
Figure 2 illustrates the effect of commands from all ESX Server hosts reaching the shared LUN on the storage array. Each ESX Server host generates 32 commands, hence at eight hosts we have reached the recommended maximum per LUN of 256. Beyond this point, latencies climb upwards of 100 msec, and could affect applications that are sensitive to latencies, although there is no drop in aggregate throughput.
These experiments represent a specific configuration with an aggressive I/O rate. Virtual machines deployed in typical customer environments may not have as high a rate and therefore may be able to scale further. In general, because of varying block sizes, access patterns, and number of outstanding commands, the results you see in your VMware environment will depend on the types of applications running. The results will also depend on the capabilities of your storage and whether it is tuned for the block sizes in your application. Also, processing very small commands adds some compute overhead in any system, be it virtualized or otherwise. Overall, the ESX Server storage stack is well tuned to run a majority of applications. If you are using iSCSI or NFS, this comparison paper nicely outlines how ESX Server can efficiently make use of the full Ethernet link speed for most block sizes.We’re always pleased to show the scalability of VMware Infrastructure 3, and the file system that supports the VI3 features is a good example. Look for more details on storage and VMFS performance in the form of whitepapers and presentations from VMware and its partners in the coming weeks.