Scaling Performance for VAIO in vSphere 6.0 U1

by Chien-Chia Chen

vSphere APIs for I/O Filtering (VAIO) is a framework that enables third-party software developers to implement data services, such as caching and replication, to vSphere. Figure 1 below shows the general architecture of VAIO. Once I/O filter libraries are installed to a virtual disk (VMDK), every I/O request generated from the guest operating system to the VMDK will first be intercepted by the VAIO framework at the file device layer. The VAIO framework then hands over the I/O request to the user space I/O filter libraries, where a series of third party data service operations can be performed against the I/O. After processing the I/O, user space I/O filter libraries return the I/O back to the VAIO framework, which continues the rest of the issuing path. Similarly, upon completion, the I/O will first be processed by the user space I/O filter libraries before continuing its original completion path.

There have been questions around the overhead of the VAIO framework due to its extra user-to-kernel communication. In this blog post, we evaluate the performance of vSphere APIs for I/O Filtering using a null I/O filter and demonstrate how VAIO scales with respect to the number of virtual machines and outstanding I/Os (OIOs). The null I/O filter accepts each I/O request and immediately returns it.

Figure 1. vSphere APIs for I/O Filtering Architecture

System Configuration

The configuration of our systems is as follows:

One ESXi host
- Machine: Dell R720 server running vSphere 6.0 Update 1
- CPU: 16-core, 2-socket (32 hyper-threads) Intel® Xeon® E5-2665 @ 2.4 GHz
- Memory: 128GB memory
- Physical Disk: One Intel® S3700 400GB SATA SSD on LSI MegaRAID SAS controller
- VM: Up to 32 link-cloned I/O Analyzer 1.6.2 VMs (SUSE Linux Enterprise 11 SP2; 1 virtual CPU (VCPU) and 1GB memory each). Each virtual machine has 1 PVSCSI controller hosting two 1GB VMDKs—one has no I/O filter and another has a null filter, both think-provisioned.
Workload: Iometer 4K sequential read (4K-aligned) with various number of OIOs

Methodology

We conduct two sets of tests separately—one against VMDK without an I/O filter (referred to as “default”) and another against the null-filter VMDK (referred to as “iofilter”). In each set of tests, every virtual machine has one Iometer disk worker to generate 4K sequential read I/Os to the VMDK under test. We have a 2-minute warm-up time and measure I/Os per second (IOPS), normalized CPU cost, and read latency over the next 2-minute test duration. The latency is the median of the average read latencies reported by all Iometer workers.

Note that I/O sizes and access patterns do not affect the performance of VAIO since it does no additional data copying, maintains the original access patterns, and incurs no extra access to physical disks.

Results

VM Scaling

Figures 2 and 3 below show the IOPS, CPU cost per 1K IOPS, and latency with a different number of virtual machines at 128 OIOs. Except for the single virtual machine test, results show that VAIO achieves similar IOPS and has similar latency compared to the default VMDK. However, VAIO introduces 10%-20% higher CPU overhead per 1K IOPS. The single virtual machine IOPS with iofilter is 80% higher than the default VMDK. This is because, in the default case, the VCPU performs the majority of synchronous I/O work; whereas, in the iofilter case, VAIO contexts take over a big portion of the work and unblock the VCPU from generating more I/Os. With additional VCPUs and Iometer disk workers to mitigate the single core bottleneck, the default VMDK is also able to drive over 70K IOPS.

Figure 2. IOPS and CPU Cost vs. Number of VMs (128 Outstanding I/Os)

Figure 3. Iometer Read Latency vs. Number of VMs (128 Outstanding I/Os)

OIO Scaling

Figures 4 and 5 below show the IOPS, CPU cost per 1K IOPS, and latency with a different number of OIOs at 16 virtual machines. A similar trend again holds that VAIO achieves the same IOPS and has the same latency compared to the default VMDK while it incurs 10%-20% higher CPU overhead per 1K IOPS.

Figure 4. Percent of a Core per 1 Thousand IOPS vs. Outstanding I/Os (16 VMs)

Figure 5. Iometer Read Latency vs. Outstanding I/Os (16 VMs)

Conclusion

Based on our evaluation, VAIO achieves comparable throughput and latency performance at a cost of 10%-20% more CPU cycles. From our experience, when using the VAIO framework, we recommend the following general best practices:

Reduce CPU over-commitment. The VAIO framework introduces at least one additional context per VMDK with an active filter. Over-committing CPU can result in intensive CPU contention, thus much worse virtualization efficiency.

Avoid blocking when developing I/O filter libraries. Keep in mind that an I/O will be blocked until the user space I/O filter finishes processing. Thus additional processing time will result in higher end-to-end latency.

Increase concurrency wisely when developing I/O filter libraries. The user space I/O filter can potentially serve I/Os from all VMDKs. Thus, when developing I/O filter libraries, it is important to be flexible in terms of concurrency to avoid a single core CPU bottleneck and meanwhile without introducing too many unnecessary active contexts that cause higher CPU contention.