Home > Blogs > VMware VROOM! Blog > Tag Archives: benchmarking

Tag Archives: benchmarking

How to correctly test the performance of Virtual SAN 6.2 deduplication feature

In VMware Virtual SAN 6.2, we introduced several features highly requested by customers, such as deduplication and compression. An overview of this feature can be found in the blog: Virtual SAN 6.2 – Deduplication And Compression Deep Dive.

The deduplication feature adds the most benefit to an all-flash Virtual SAN environment because, while SSDs are more expensive than spinning disks, the cost is amortized because more workloads can fit on the smaller SSDs. Therefore, our performance testing is performed on an all-flash Virtual SAN cluster with deduplication enabled.

When testing the performance of the deduplication feature for Virtual SAN, we observed the following:

  • Unexpected deduplication ratio
  • High device read latency in the capacity tier, even though the SSD is perfectly fine

In this blog, we discuss the reason behind these two issues and share our testing experience.

When we tested the performance of Virtual SAN, we decided to use two common tools, IOBlazer and Iometer:

  1. We used IOBlazer to populate the disks. We configured IOBlazer to run 100% large sequential writes. This was to make sure all the blocks were allocated before testing any read-related workload. Some people prefer to zero out all the blocks using the dd command, which has a similar effect.
  2. We then ran an Iometer workload. We set the read percentage, randomness, I/O size, number of outstanding I/Os, and so on.

We found, however, that there were two issues with the above procedure when testing the deduplication feature:

  • Iometer did not support configuring I/O content. In other words, we could not use Iometer to generate I/Os with various deduplication ratios in step 2.
  • We should not have populated the disks using IOBlazer or dd in step 1 because each utility pollutes the disks with random data or zeros, both of which yielded the wrong deduplication ratio for later tests.

To address these issues, we decided to use the Flexible I/O (FIO) benchmark to both populate the disks and run the tests. FIO allowed us to specify the deduplication ratio. By following these steps, we were able to successfully test the deduplication feature in Virtual SAN 6.2:

  1. Run FIO with 100% 4KB sequential write with the given deduplication and compression ratio. This will populate the disks with the desired deduplication and compression ratio.
  2. Run FIO with the specified read/write percentage, I/O size, randomness, number of outstanding I/Os, and deduplication and compression ratio.

Below is a sample configuration file for FIO. We modified the parameters for different tests.

ioengine=libaio; async I/O engine for Linux
thread ; use thread rather than process
; Test name: 4K_rd70_rand100_dedup0_compr0

[job 1]
[job 2]

If steps 1 and 2 were not performed properly, the results could be unexpected. To further illustrate that, we take two issues we encountered as examples.

Issue #1: The SSD showed high read latency, but the SSD hardware had no issues

We observed a high device read latency issue with FIO micro-benchmarks. The high read latency occurred because we were issuing a large amount of concurrent I/O (outstanding I/O, also known as OIO) to the same Logic Block Address/LBA (or a small range of LBAs) on the SSD. This is more likely to happen with any type of deduplication solution, regardless of the storage vendor.

To resolve this issue, we first performed a test to learn the behavior of the SSD device. Below shows the read latency to one address with an increasing amount of outstanding I/Os.

4KB read from the same LBA:
1 OIO Latency: 0.12 ms
16 OIOs Latency: 1.51 ms
32 OIOs Latency: 3.06 ms
64 OIOs Latency: 6.07 ms
128 OIOs Latency: 12.08 ms
256 OIOs Latency: 12.68 ms

When we issued multiple OIOs to a single 4KB block, those I/Os were serialized to one single channel inside the SSD device that was connected to that offset. In other words, we lost the benefits of the SSD’s internal parallelism (from multiple channels). The device latency rose as we increased the number of OIOs. High OIO to the same LBA (or a smaller range of LBAs) caused high device read latency.

In the extreme case where we prepared the disk by zeroing out all the blocks, all the data was deduplicated to one block. As a result, the upcoming read I/O was issued to the same device address, which caused high device read latency as discussed above.

Figures 1 and 2 (stats from Virtual SAN observer tool) show sample results from our test. (Even though the screenshots show HDDs, our test was on an all-flash Virtual SAN cluster. The HDDs in the graph actually mean SSDs used as capacity tier devices.) As can be seen, inside one disk group, one capacity tier SSD (naa.55cd2e404ba2ce71 in Figure 1 or naa.55cd2e404ba535b7 in Figure 2) always shows higher than usual read latency (than in the other capacity tier SSDs). This is because we zeroed out the data blocks before running the test. Later in the test, a large amount of outstanding read I/Os were issued to a single address on that SSD.

Note: In Figures 1 and 2, where there are no units specified, the unit is milliseconds. Where an “m” is specified, the unit is microseconds. Where “k” is specified, the unit is thousands.


Figure 1. Sample test 1 showed the first capacity-tier SSD (“HDD”)  to have up to 3 milliseconds latency, which is much higher than the next two capacity-tier SSDs (also labelled “HDD”), which show just below 150 microseconds of latency.


Figure 2. Sample test 2 is similar to sample test 1. The bottom capacity-tier SSD (“HDD”) shows up to 6 milliseconds of latency, whereas the first two show slightly over 100 microseconds.

Issue #2: Deduplication ratio was not what we set

Because Virtual SAN distributes multiple virtual disks across its datastore, it is hard to determine the exact deduplication ratio of the data that the workload generated. In the FIO configuration file, we changed the dedupe_percentage to a desired value. However, in the testing system, there were a couple factors that affected the actual deduplication ratio reported by Virtual SAN.

  • I/Os from other virtual disks (vmdk files) can have the duplicated data. In the FIO configuration file, if the randrepeat parameter is set to 1, FIO will use the same random seed for all the disks. Although the data pattern obeys the dedupe_percentage set by the user for each vmdk, there will be high deduplicated data across vmdks. Note that those vmdks are placed on the same Virtual SAN datastore, which means that datastore will see more duplicated data than specified.
  • I/O size when preparing disks will affect the deduplication ratio. Currently, Virtual SAN uses 4KB chunk size as the unit to calculate the deduplication ratio. If the user uses non 4KB IO size to prepare the disk, Virtual SAN could see a different deduplication ratio. Meanwhile, if the IO is not aligned to 4KB (blockalign parameter), Virtual SAN could also observe a different deduplication ratio.

Figure 3 (below) shows a sample test in which we ran FIO with a 0% deduplication ratio. Due to the issues described above, Virtual SAN erroneously reports about an 80% deduplication ratio (shown in blue).


Figure 3. The blue line shows a deduplication percentage of about 80%, even though we set deduplication to be 0%.

To avoid these problems, we suggest performance testers use a 4KB I/O size (and aligned) and set randrepeat to 0 to prepare the disk in order to get the desired deduplication ratio. Note that Virtual SAN can properly handle any type of I/O configuration. The purpose of this blog is to explain the possible discrepancy between the FIO-specified dedupe_percentage and the Virtual SAN reported deduplication ratio if performance testers use different I/O configurations to evaluate the Virtual SAN datastore.


Figure 4. No blue line is shown, indicating the correct deduplication of 0%.



Fault Tolerance Performance in vSphere 6

VMware has published a technical white paper about vSphere 6 Fault Tolerance architecture and performance. The paper describes which types of applications work best in virtual machines with vSphere FT enabled.

VMware vSphere Fault Tolerance (FT) provides continuous availability to virtual machines that require a high amount of uptime. If the virtual machine fails, another virtual machine is ready to take over the job.  vSphere achieves FT by maintaining primary and secondary virtual machines using a new technology named Fast Checkpointing. This technology is similar to Storage vMotion, which copies the virtual machine state (storage, memory, and networking) to the secondary ESXi host. Fast Checkpointing keeps the primary and secondary virtual machines in sync.

vSphere FT works with (and requires) vSphere HA—when an administrator enables FT, vSphere HA selects the secondary VM (admins can vMotion the VM to another server if needed). vSphere HA also creates a new secondary if the primary fails—the original secondary becomes the new primary, and vSphere HA selects an available virtual machine to use as the new secondary.

vSphere 6 FT supports applications with up to 4 vCPUs and 64GB memory on the ESXi host. The performance study shows results for various workloads run on virtual machines with 1, 2, and 4 vCPUs.

The workloads—which tax the virtual machine’s CPU, disk, and network—include:

  • Kernel compile – loads the CPU at 100%
  • Netperf-  measures network throughput and latency
  • Iometer- characterizes the storage I/O of a Microsoft Windows virtual machine
  • Swingbench- drives an OLTP load on a virtual machine running Oracle 11g
  • DVD Store –  drives an OLTP load on a virtual machine running Microsoft SQL Server 2012
  • A brokerage workload – simulates an OLTP load of a brokerage firm
  • vCenterServer workload – simulates actions performed in vCenter Server

Testing shows that vSphere FT can successfully protect a number of workloads like CPU-bound workloads, I/O-bound workloads, servers, and complex database workloads; however, admins should not use vSphere FT to protect highly latency-sensitive applications like voice-over-IP (VOIP) or high-frequency trading (HFT).

For the results of these tests, read the paper. Also useful is the VMware Fault Tolerance FAQ.

Virtualizing Performance-Critical Database Applications in VMware vSphere 6.0

by Priti Mishra

Performance studies have previously shown that there is no doubt virtualized servers can run a variety of applications near, or in some cases even above, that of software running natively (on bare metal). In a new white paper, we raise the bar higher and test “monster” vSphere virtual machines loaded with CPU and running the most taxing databases and transaction processing applications.

The benchmark workload, which we call Order-Entry, is based on an industry-standard online transaction processing (OLTP) benchmark called TPC-C. Both rigorous and demanding, the Order-Entry workload pushes virtual machine performance.

Note: The Order Entry benchmark is derived from the TPC-C workload, but is not compliant with the TPC-C specification, and its results are not comparable to TPC-C results.

The white paper quantifies the:

  • Performance differential between ESXi 6.0 and native
  • Performance differential between ESXi 6.0 and ESXi 5.1
  • Performance gains due to enhancements built into ESXi 6.0

Results from these experiments show that even the most demanding applications can be run, with excellent performance, in a virtualized environment with ESXi 6.0.  For example, our test results show that ESXi 6.0 virtual machines run out of the box at 90% of the performance of native systems. In addition, a 64-vCPU, 475GB VM processes 59.5K DBMS transactions per second while issuing 155K IOPS, capabilities well above even the high-end Oracle database installations. Even for applications that may require 64 or 128 vCPUs, the high-end performance boost of ESXi 6.0 over ESXi 5.1 makes ESXi 6.0 the best platform for virtualizing databases such as Oracle.

ESXi 6.0 Performance Relative to Native

With a 64-vCPU VM running on a 72-pCPU ESXi host, throughput was 90% of native throughput on the same hardware platform. Statistics which give an indication of the load placed on the system in the native and virtual machine configurations are summarized in Table 1.

Metric Native VM
Throughput in transactions per second 66.5K 59.5K
Average CPU utilization of 72 logical CPUs 84.7% 85.1%
Disk IOPS 173K 155K
Disk Megabytes/second 929MB/s 831MB/s
Network packets/second 71K/s receive
71K/s send
63K/s receive
64K/s send
Network Megabytes/second 15MB/s receive
36MB/s send
13MB/s receive
32MB/s send

Table 1. Comparison of Native and Virtual Machine Benchmark Load Profiles


The corresponding guest statistics in Table 2 provide another perspective on the resource-intensive nature of the workload. These common Linux performance metrics show that while the benchmark workload was heavy in terms of raw CPU demands, it also placed a heavy load on the operating system, interrupt handling, and the storage subsystem, areas that have traditionally been associated with high virtualization overheads.


Metric Amount
Interrupts per second 327K
Disk IOPS 155K
Context switches per second 287K
Load average 231

Table 2. Guest OS Statistics

ESXi 6.0 Performance Relative to ESXi 5.1

Experimental data comparing ESXi 6.0 with ESXi 5.1 (see Figures 1 and 2) show that high-end scale-up with ESXi 6.0 mirrors that of native systems.


Figure 1. Absolute throughput values

With ESXi 5.1, the Order-Entry benchmark throughput of a 64-vCPU VM on a 4-socket, 32 core/64 thread E7- 4870 (Westmere) server was 70% of the throughput of the same server in native mode when both servers were running at 77% CPU utilization (the native server reached a maximum CPU utilization of 88% and throughput of 54.8 transactions per second).


Figure 2. Relative throughput ratios

vSphere has the capability to handle loads far larger than that demanded by most Oracle database applications in production. Support for monster VMs with up to 128 vCPUs, throughput which is 90% of native and a significant performance boost over ESXi 5.1, make ESXi 6.0 an excellent platform for virtualizing very high end Oracle databases.

For details regarding experiments and the performance enhancements in vSphere, please read the paper.

VMware vCloud Air Database Performance Scalability with SQL Server

Previous posts have shown vSphere can easily handle running Microsoft SQL Server on four-socket servers with large numbers of cores—with vSphere 5.5 on Westmere-EX and more recently with vSphere 6 on Ivy Bridge-EX.  We recently ran similar tests on vCloud Air to measure how these enterprise databases with mission critical performance requirements perform in a cloud environment. The tests show that SQL Server databases scale very well on vCloud Air with a variety of virtual machine (VM) counts and virtual CPU (vCPU) sizes.

The benchmark tests were run with vCloud Air using their Virtual Private Cloud (VPC) subscription-based service.  This is a very compelling hybrid cloud service that allows for an on-premises vSphere infrastructure to be expanded into the public cloud in a secure and scalable way. The underlying host hardware consisted of two 8-core CPUs for a total of 16 physical cores, which meant that the maximum number of vCPUs was 16 (although additional processors were available via Hyper-Threading, they were not utilized).

Windows Server 2012 R2 was the guest OS, and SQL Server 2012 Standard edition was the database engine used for all the VMs.  All databases were placed on an SSD Accelerated storage tier for maximum disk I/O performance.  The test configurations are summarized below:

# VMs, # vCPUs, Memory configurations tested

DVD Store 2.1 (an open-source OLTP database stress tool) was the workload used to stress the VMs.  The first experiment was to scale up the number of 4 vCPU VMs.  The graph below shows that as the number of VMs is increased from 1 to 4, the aggregate performance (measured in orders per minute, or OPM) increases correspondingly:
When the size of each VM was doubled from 4 to 8 virtual CPUs, the OPM also approximately doubles for the same number of VMs as shown in the chart below.vCA_SQL_8vCPU

This final chart includes a test run with one large 16 vCPU VM.  As expected, the 16 vCPU performance was similar to the four 4vCPU VMs and eight 2vCPU VM test cases.  The slight drop can be attributed to spanning multiple physical processors and thus multiple NUMA nodes within a single VM.


In summary, SQL Server was found to perform and scale extremely well running on vCloud Air with 4, 8, and 16 vCPU VMs.  In the future, look for more benchmarks in the cloud as it continues to evolve!

For more information on vCloud Air, check out these third-party studies from Principled Technologies that compare it to competitive offerings, namely Microsoft Azure and Amazon Web Services (AWS):

Scaling Performance for VAIO in vSphere 6.0 U1

by Chien-Chia Chen

vSphere APIs for I/O Filtering (VAIO) is a framework that enables third-party software developers to implement data services, such as caching and replication, to vSphere. Figure 1 below shows the general architecture of VAIO. Once I/O filter libraries are installed to a virtual disk (VMDK), every I/O request generated from the guest operating system to the VMDK will first be intercepted by the VAIO framework at the file device layer. The VAIO framework then hands over the I/O request to the user space I/O filter libraries, where a series of third party data service operations can be performed against the I/O. After processing the I/O, user space I/O filter libraries return the I/O back to the VAIO framework, which continues the rest of the issuing path. Similarly, upon completion, the I/O will first be processed by the user space I/O filter libraries before continuing its original completion path.

There have been questions around the overhead of the VAIO framework due to its extra user-to-kernel communication. In this blog post, we evaluate the performance of vSphere APIs for I/O Filtering using a null I/O filter and demonstrate how VAIO scales with respect to the number of virtual machines and outstanding I/Os (OIOs). The null I/O filter accepts each I/O request and immediately returns it.


Figure 1. vSphere APIs for I/O Filtering Architecture

System Configuration

The configuration of our systems is as follows:

  • One ESXi host
    • Machine: Dell R720 server running vSphere 6.0 Update 1
    • CPU: 16-core, 2-socket (32 hyper-threads) Intel® Xeon® E5-2665 @ 2.4 GHz
    • Memory: 128GB memory
    • Physical Disk: One Intel® S3700 400GB SATA SSD on LSI MegaRAID SAS controller
    • VM: Up to 32 link-cloned I/O Analyzer 1.6.2 VMs (SUSE Linux Enterprise 11 SP2; 1 virtual CPU (VCPU) and 1GB memory each). Each virtual machine has 1 PVSCSI controller hosting two 1GB VMDKs—one has no I/O filter and another has a null filter, both think-provisioned.
  • Workload: Iometer 4K sequential read (4K-aligned) with various number of OIOs


We conduct two sets of tests separately—one against VMDK without an I/O filter (referred to as “default”) and another against the null-filter VMDK (referred to as “iofilter”). In each set of tests, every virtual machine has one Iometer disk worker to generate 4K sequential read I/Os to the VMDK under test. We have a 2-minute warm-up time and measure I/Os per second (IOPS), normalized CPU cost, and read latency over the next 2-minute test duration. The latency is the median of the average read latencies reported by all Iometer workers.

Note that I/O sizes and access patterns do not affect the performance of VAIO since it does no additional data copying, maintains the original access patterns, and incurs no extra access to physical disks.


VM Scaling

Figures 2 and 3 below show the IOPS, CPU cost per 1K IOPS, and latency with a different number of virtual machines at 128 OIOs. Except for the single virtual machine test, results show that VAIO achieves similar IOPS and has similar latency compared to the default VMDK. However, VAIO introduces 10%-20% higher CPU overhead per 1K IOPS. The single virtual machine IOPS with iofilter is 80% higher than the default VMDK. This is because, in the default case, the VCPU performs the majority of synchronous I/O work; whereas, in the iofilter case, VAIO contexts take over a big portion of the work and unblock the VCPU from generating more I/Os. With additional VCPUs and Iometer disk workers to mitigate the single core bottleneck, the default VMDK is also able to drive over 70K IOPS.

Figure 2. IOPS and CPU Cost vs. Number of VMs (128 Outstanding I/Os)


Figure 3. Iometer Read Latency vs. Number of VMs (128 Outstanding I/Os)


OIO Scaling

Figures 4 and 5 below show the IOPS, CPU cost per 1K IOPS, and latency with a different number of OIOs at 16 virtual machines. A similar trend again holds that VAIO achieves the same IOPS and has the same latency compared to the default VMDK while it incurs 10%-20% higher CPU overhead per 1K IOPS.


Figure 4. Percent of a Core per 1 Thousand IOPS vs. Outstanding I/Os (16 VMs)


Figure 5. Iometer Read Latency vs. Outstanding I/Os (16 VMs)


Based on our evaluation, VAIO achieves comparable throughput and latency performance at a cost of 10%-20% more CPU cycles. From our experience, when using the VAIO framework, we recommend the following general best practices:

  • Reduce CPU over-commitment. The VAIO framework introduces at least one additional context per VMDK with an active filter. Over-committing CPU can result in intensive CPU contention, thus much worse virtualization efficiency.
  • Avoid blocking when developing I/O filter libraries. Keep in mind that an I/O will be blocked until the user space I/O filter finishes processing. Thus additional processing time will result in higher end-to-end latency.
  • Increase concurrency wisely when developing I/O filter libraries. The user space I/O filter can potentially serve I/Os from all VMDKs. Thus, when developing I/O filter libraries, it is important to be flexible in terms of concurrency to avoid a single core CPU bottleneck and meanwhile without introducing too many unnecessary active contexts that cause higher CPU contention.


Dynamic Host-Wide Performance Tuning in VMware vSphere 6.0

by Chien-Chia Chen


The networking stack of vSphere is, by default, tuned to balance the tradeoffs between CPU cost and latency to provide good performance across a wide variety of applications. However, there are some cases where using a tunable provides better performance. An example is Web-farm workloads, or any circumstance where a high consolidation ratio (lots of VMs on a single ESXi host) is preferred over extremely low end-to-end latency. VMware vSphere 6.0 introduces the Dynamic Host-Wide Performance Tuning  feature (also known as dense mode), which provides a single configuration option to dynamically optimize individual ESXi hosts for high consolidation scenarios under certain use cases. Later in this blog, we define those use cases. Right now, we take a look at how dense mode works from an internal viewpoint.

Mitigating Virtualization Inefficiency under High Consolidation Scenarios

Figure 1 shows an example of the thread contexts within a high consolidation environment. In addition to the Virtual CPUs (each labeled VCPU) of the VMs, there are per-VM vmkernel threads (device-emulation, labeled “Dev Emu”, threads in the figure) and multiple vmkernel threads for each Physical NIC (PNIC) executing physical device virtualization code and virtual switching code. One major source of virtualization inefficiency is the frequent context switches among all these threads. While context switches occur due to a variety of reasons, the predominant networking-related reason is Virtual NIC (VNIC) Interrupt Coalescing, namely, how frequently does the vmkernel interrupt the guest for new receive packets (or vice versa for transmit packets). More frequent interruptions are likely to result in lower per-packet latency while increasing virtualization overhead. At very high consolidation ratios, the overhead from increased interrupts hurts performance.

Dense mode uses two techniques to reduce the number of context switches:

  • The VNIC coalescing scheme will be changed to a less aggressive scheme called static coalescing.
    With static coalescing, a fixed number of requests are delivered in each batch of communication between the Virtual Machine Monitor (VMM) and vmkernel. This, in general, reduces the frequency of communication, thus fewer context switches, resulting in better virtualization efficiency.
  • The device emulation vmkernel thread wakeup opportunities are greatly reduced.
    The device-emulation threads now will only be executed either periodically with a longer timer or when the corresponding VCPUs are halted. This optimization largely reduces the frequency that device emulation threads being waken up, so frequency of context switch is also lowered.


Figure 1. High Consolidation Example

Enabling Dense Mode

Dense mode is disabled by default in vSphere 6.0. To enable it, change Net.NetTuneHostMode in the ESXi host’s Advanced System Settings (shown below in Figure 2) to dense.


Figure 2. Enabling Dynamic Host-Wide Performance Tuning
“default” is disabled; “dense” is enabled

Once dense mode is enabled, the system periodically checks the load of the ESXi host (every 60 seconds by default) based on the following three thresholds:

  • Number of VMs ≥ number of PCPUs
  • Number of VCPUs ≥ number of 2 * PCPUs
  • Total PCPU utilization ≥ 50%

When the system load exceeds the above thresholds, these optimizations will be in effect for all regular VMs that carry default settings. When the system load drops below any of the thresholds, those optimizations will be automatically removed from all affected VMs such that the ESXi host performs identical to when dense mode is disabled.

Applicable Workloads

Enabling dense mode can potentially impact performance negatively for some applications. So, before enabling, carefully profile the applications to determine whether or not the workload will benefit from this feature. Generally speaking, the feature improves the VM consolidation ratio on an ESXi host running medium network throughput applications with some latency tolerance and is CPU bounded. A good use case is Web-farm workload, which needs CPU to process Web requests while only generating a medium level of network traffic and having a few milliseconds of tolerance to end-to-end latency. In contrast, if the bottleneck is not at CPU, enabling this feature results in hurting network latency only due to less frequent context switching. For example, the following workloads are NOT good use cases of the feature:

  • X Throughput-intensive workload: Since network is the bottleneck, reducing the CPU cost would not necessarily improve network throughput.
  • X Little or no network traffic: If there is too little network traffic, all the dense mode optimizations barely have any effect.
  • X Latency-sensitive workload: When running latency-sensitive workloads, another set of optimizations is needed and is documented in the “Deploying Extremely Latency-Sensitive Applications in VMware vSphere 5.5” performance white paper.


To evaluate this feature, we implement a lightweight Web benchmark, which has two lightweight clients and a large number of lightweight Web server VMs. The clients send HTTP requests to all Web servers at a given request rate, wait for responses, and report the response time. The request is for static content and it includes multiple text and JPEG files totaling around 100KB in size. The Web server has memory caching enabled and therefore serves all the content from memory. Two different request rates are used in the evaluation:

  1. Medium request rate: 25 requests per second per server
  2. High request rate: 50 requests per second per server

In both cases, the total packet rate on the ESXi host is around 400 Kilo-Packets/Second (KPPS) to 700 KPPS in each direction, where the receiving packet rate is slightly higher than the transmitting packet rate.

System Configuration

We configured our systems as follows:

  • One ESXi host (running Web server VMs)
    • Machine: HP DL580 G7 server running vSphere 6.0
    • CPU: Four 10-core Intel® Xeon® E7-4870 @ 2.4 GHz
    • Memory: 512 GB memory
    • Physical NIC: Two dual-port Intel X520 with a total of three active 10GbE ports
    • Virtual Switching: One virtual distributed switch (vDS) with three 10GbE uplinks using default teaming policy
    • VM: Red Hat Linux Enterprise Server 6.3 assigned one VCPU, 1GB memory, and one VMXNET3 VNIC
  • Two Clients (generating Web requests)
    • Machine: HP DL585 G7 server running Red Hat Linux Enterprise Server 6.3
    • CPU: Four 8-core AMD Opteron™ 6212 @ 2.6 GHz
    • Memory: 128 GB memory
    • Physical NIC: One dual-port Intel X520 with one active 10GbE port on each client


Medium Request Rate

We first present the evaluation results for medium request rate workloads. Figures 3 and 4 below show the 95th-percentile response time and total host CPU utilization as the number of VMs increase, respectively. For the 95th-percentile response time, we consider 100ms as the preferred latency tolerance.

Figure 3 shows that at 100ms, default mode consolidates only about 470 Web server VMs, whereas dense mode consolidates more than 510 VMs, which is an over 10% improvement. For CPU utilization, we consider 90% is the desired maximum utilization.


Figure 3. Medium Request Rate 95-Percentile Response Time
(Latency Tolerance 100ms)

Figure 4 shows that at 90% utilization, default mode consolidates around 465 Web server VMs, whereas dense mode consolidates about 495 Web server VMs, which is still a nearly 10% improvement in consolidation ratio. We also notice that dense mode, in fact, also reduces response time. This is because the great reduction in context switching improves virtualization efficiency, which compensates the increase in latency due to more aggressive batching.


Figure 4. Medium Request Rate Host Utilization
(Desired Maximum Utilization 90%)

High Request Rate

Figures 5 and 6 below show the 95th-percentile response time and total host CPU utilization for a high request rate as the number of VMs increase, respectively. Because the request rate is doubled, we reduce the number of Web server VMs consolidated on the ESXi host. Figure 5 first shows that at 100ms response time, dense mode only consolidates about 5% more VMs in a medium request rate case (from ~280 VMs to ~290 VMs). However, if we look at the CPU utilization as shown in Figure 6, at 90% desired maximum load, dense mode still consolidates about 10% more VMs (from ~ 240 VMs to ~260 VMs). Considering both response time and utilization metrics, because there are a fewer number of active contexts under the high request rate workload, the benefit of reducing context switches will be less significant compared to a medium request rate case.


Figure 5. High Request Rate 95-Percentile Response Time
(Latency Tolerance 100ms)


Figure 6. High Request Rate Host Utilization
(Desired Maximum Utilization at 90%)


We presented the Dynamic Host-Wide Performance Tuning feature, also known as dense mode. We proved a Web-farm-like workload achieves up to 10% higher consolidation ratio while still meeting 100ms latency tolerance and 90% maximum host utilization. We emphasized that the improvements do not apply to every kind of application. Because of this, you should carefully profile the workloads before enabling dense mode.

VMware Virtual SAN Stretched Cluster Best Practices White Paper

VMware Virtual SAN 6.1 introduced the concept of a stretched cluster which allows the Virtual SAN customer to configure two geographically located sites, while synchronously replicating data between the two sites. A technical white paper about the Virtual SAN stretched cluster performance has now been published. This paper provides guidelines on how to get the best performance for applications deployed on a Virtual SAN stretched cluster environment.

The chart below, borrowed from the white paper, compares the performance of the Virtual SAN 6.1 stretched cluster deployment against the regular Virtual SAN cluster without any fault domains. A nine- node Virtual SAN stretched cluster is considered with two different configurations of inter-site latency: 1ms and 5ms. The DVD Store benchmark is executed on four virtual machines on each host of the nine-node Virtual SAN stretched cluster. The DVD Store performance metrics of cumulated orders per minute in the cluster, read/write IOPs, and average latency are compared with a similar workload on the regular Virtual SAN cluster. The orders per minute (OPM) is lower by 3% and 6% for the 1ms and 5ms inter-site latency stretched cluster compared to the regular Virtual SAN cluster.

Figure 1a.  DVD Store orders per minute in the cluster and guest IOPS comparison

Guest read/write IOPS and latency were also monitored. The read/write mix ratio for the DVD Store workload is roughly at 1/3 read and 2/3 write. Write latency shows an obvious increase trend when the inter-site latency is higher, while the read latency is only marginally impacted. As a result, the average latency increases from 2.4ms to 2.7ms, and 5.1ms for 1ms and 5ms inter-site latency configuration.

Figure 1b.  DVD Store latency comparison

These results demonstrate that the inter-site latency in a Virtual SAN stretched cluster deployment has a marginal performance impact on a commercial workload like DVD Store. More results are available in the white paper.

Measuring Cloud Scalability Using the Weathervane Benchmark

Cloud-based deployments continue to be a hot topic in many of today’s corporations.  Often the discussion revolves around workload portability, ease of migration, and service pricing differences.  In an effort to bring performance into the discussion we decided to leverage VMware’s new benchmark, Weathervane.  As a follow-on to Harold Rosenberg’s introductory Weathervane post we decided to showcase some of the flexibility and scalability of our new large-scale benchmark.  Previously, Harold presented some initial scalability data running on three local vSphere 6 hosts.  For this article, we decided to extend this further by demonstrating Weathervane’s ability to run within a non-VMware cloud environment and scaling up the number of app servers.

Weathervane is a new web-application benchmark architected to simulate modern-day web applications.  It consists of a benchmark application and a workload driver.  Combined, they simulate the behavior of everyday users attending a real-time auction.  For more details on Weathervane I encourage you to review the introductory post.

Environment Configuration:
Cloud Environment: Amazon AWS, US West.
Instance Types: M3.XLarge, M3.Large, C3.Large.
Instance Notes: Database instances utilized an additional 300GB io1 tier data disk.
Instance Operating System: Centos 6.5 x64.
Application: Weathervane Internal Build 084.

Testing Methodology:
All instances were run within the same cloud environment to reduce network-induced latencies.  We started with a base configuration consisting of eight instances.  We then  scaled out the number of workload drivers and application servers in an effort to identify how a cloud environment scaled as application workload needs increased.  We used Weathervane’s FindMax functionality which runs a series of tests to determine the maximum number of users the configuration can sustain while still meeting QoS requirements.  It should be noted that the early experimentation allowed us to identify the maximum needs for the other services beyond the workload drivers and application servers to reduce the likelihood of bottlenecks in these services.  Below is a block diagram of the configurations used for the scaled-out Weathervane deployment.


For our analysis of Weathervane cloud scaling we ran multiple iterations for each scale load level and selected the average.  We automated the process to ensure consistency.  Our results show both the number of users sustained as well as the http requests per second as reported by the benchmark harness.


As you can see in the above graph, for our cloud environment running Weathervane, scaling the number of applications servers yielded nearly linear scaling up to five application servers. The delta in scaling between the number of users and the http requests per second sustained was less than 1%.  Due to time constraints we were unable to test beyond five application servers but we expect that the scaling would have continued upwards well beyond the load levels presented.

Although just a small sample of what Weathervane and cloud environments can scale to, this brief article highlights both the benchmark and cloud environment scaling.  Though Weathervane hasn’t been released publicly yet, it’s easy to see how this type of controlled, scalable benchmark will assist in performance evaluations of a diverse set of environments.  Look for more Weathervane based cloud performance analysis in the future.


Virtualized Storage Performance: RAID Groups versus Storage pools

RAID, a redundant array of independent disks, has traditionally been the foundation of enterprise storage. Grouping multiple disks into one logical unit can vastly increase the availability and performance of storage by protecting against disk failure, allowing greater I/O parallelism, and pooling capacity. Storage pools similarly increase the capacity and performance of storage, but are easier to configure and manage than RAID groups.

RAID groups have traditionally been regarded as offering better and more predictable performance than storage pools. Although both technologies were developed for magnetic hard disk drives (HDDs), solid-state drives (SSDs), which use flash memory, have become prevalent. Virtualized environments are also common and tend to create highly randomized I/O given the fact that multiple workloads are run simultaneously.

We set out to see how the performance of RAID group and storage pool provisioning methods compare in today’s virtualized environments.

First, let’s take a closer look at each storage provisioning type.

RAID Groups

A RAID group unifies a number of disks into one logical unit and distributes data across multiple drives. RAID groups can be configured with a particular protection level depending on the performance, capacity, and redundancy needs of the environment. LUNs are then allocated from the RAID group. RAID groups typically contain only identical drives, and the maximum number of disks in a RAID group varies by system model but is generally below fifty. Because drives typically have well defined performance characteristics, the overall RAID group performance can be calculated as the performance of all drives in the group minus the RAID overhead. To provide consistent performance, workloads with different I/O profiles (e.g., sequential vs. random I/O) or different performance needs should be physically isolated in different RAID groups so they do not share disks.

Storage Pools

Storage pools, or simply ‘pools’, are very similar to RAID groups in some ways. Implementation varies by vendor, but generally pools are made up of one or more private RAID groups, which are not visible to the user, or they are composed of user-configured RAID groups which are added manually to the pool. LUNs are then allocated from the pool. Storage pools can contain up to hundreds of drives, often all the drives in an array. As business needs grow, storage pools can be easily scaled up by adding drives or RAID groups and expanding LUN capacity. Storage pools can contain multiple types and sizes of drives and can spread workloads over more drives for a greater degree of parallelism.

Storage pools are usually required for array features like automated storage tiering, where faster SSDs can serve as a data cache among a larger group of HDDs, as well as other array-level data services like compression, deduplication, and thin provisioning. Because of their larger maximum size, storage pools, unlike RAID groups, can take advantage of vSphere 6 maximum LUN sizes of 64TB.

We used two benchmarks to compare the performance of RAID groups and storage pools: VMmark, which is a virtualization platform benchmark, and I/O Analyzer with Iometer, which is a storage microbenchmark.  VMmark is a multi-host virtualization benchmark that uses diverse application workloads as well as common platform level workloads to model the demands of the datacenter. VMs running a complete set of the application workloads are grouped into units of load called tiles. For more details, see the VMmark 2.5 overview. Iometer places high levels of load on the disk, but does not stress any other system resources. Together, these benchmarks give us both a ‘real-world’ and a more focused perspective on storage performance.

VMmark Testing

Array Configuration

Testing was conducted on an EMC VNX5800 block storage SAN with Fibre Channel. This was one of the many storage solutions which offered both RAID group and storage pool technologies. Disks were 200GB single-level cell (SLC) SSDs. Storage configuration followed array best practices, including balancing LUNs across Storage Processors and ensuring that RAID groups and LUNs did not span the array bus. One way to optimize SSD performance is to leave up to 50% of the SSD capacity unutilized, also known as overprovisioning. To follow this best practice, 50% of the RAID group or storage pool was not allocated to any LUN. Since overprovisioning SSDs can be an expensive proposition, we also tested the same configuration with 100% of the storage pool or RAID group allocated.

RAID Group Configuration

Four RAID 5 groups were used, each composed of 15 SSDs. RAID 5 was selected for its suitability for general purpose workloads. RAID 5 provides tolerance against a single disk failure. For best performance and capacity, RAID 5 groups should be sized to multiples of five or nine drives, so this group maintains a multiple of the preferred five-drive count. One LUN was created in each of the four RAID groups. The LUN was sized to either 50% of the RAID group (Best Practices) or 100% (Fully Allocated). For testing, the capacity of each LUN was fully utilized by VMmark virtual machines and randomized data.

RAID Group Configuration VMmark Storage Comparison        VMmark Storage Pool Configuration Storage Comparison

Storage Pool Configuration

A single RAID 5 Storage Pool containing all 60 SSDs was used. Four thick LUNs were allocated from the pool, meaning that all of the storage space was reserved on the volume. LUNs were equivalent in size and consumed a total of either 50% (Best Practices) or 100% (Fully Allocated) of the pool capacity.

Storage Layout

Most of the VMmark storage load was created by two types of virtual machines: database (DVD Store) and mail server (Microsoft Exchange). These virtual machines were isolated on two different LUNs. The remaining virtual machines were spread across the remaining two LUNs. That is, in the RAID group case, storage-heavy workloads were physically isolated in different RAID groups, but in the storage pool case, all workloads shared the same pool.

Systems Under Test: Two Dell PowerEdge R720 servers
Configuration Per Server:  
     Virtualization Platform: VMware vSphere 6.0. VMs used virtual hardware version 11 and current VMware Tools.
     CPUs: Two 12-core Intel® Xeon® E5-2697 v2 @ 2.7 GHz, Turbo Boost Enabled, up to 3.5 GHz, Hyper-Threading enabled.
     Memory: 256GB ECC DDR3 @ 1866MHz
     Host Bus Adapter: QLogic ISP2532 DualPort 8Gb Fibre Channel to PCI Express
     Network Controller: One Intel 82599EB dual-port 10 Gigabit PCIe Adapter, one Intel I350 Dual-Port Gigabit PCIe Adapter

Each configuration was tested at three different load points: 1 tile (the lowest load level), 7 tiles (an approximate mid-point), and 13 tiles, which was the maximum number of tiles that still met Quality of Service (QoS) requirements. All datapoints represent the mean of two tests of each configuration.

VMmark Results

RAID Group vs. Storage Pool Performance comparison using VMmark benchmark

Across all load levels tested, the VMmark performance score, which is a function of application throughput, was similar regardless of storage provisioning type. Neither the storage type used nor the capacity allocated affected throughput.

VMmark 2.5 performance scores are based on application and infrastructure workload throughput, while application latency reflects Quality of Service. For the Mail Server, Olio, and DVD Store 2 workloads, latency is defined as the application’s response time. We wanted to see how storage configuration affected application latency as opposed to the VMmark score. All latencies are normalized to the lowest 1-tile results.

Storage configuration did not affect VMmark application latencies.

Application Latency in VMmark Storage Comparison RAID Group vs Storage Pool

Lastly, we measured read and write I/O latencies: esxtop Average Guest MilliSec/Write and Average Guest MilliSec/Read. This is the round trip I/O latency as seen by the Guest operating system.

VMmark Storage Latency Storage Comparison RAID Group vs Storage Pool

No differences emerged in I/O latencies.

I/O Analyzer with Iometer Testing

In the second set of experiments, we wanted to see if we would find similar results while testing storage using a synthetic microbenchmark. I/O Analyzer is a tool which uses Iometer to drive load on a Linux-based virtual machine then collates the performance results. The benefit of using a microbenchmark like Iometer is that it places heavy load on just the storage subsystem, ensuring that no other subsystem is the bottleneck.


Testing used a VNX5800 array and RAID 5 level as in the prior configuration, but all storage configurations spanned 9 SSDs, also a preferred drive count. In contrast to the prior test, the storage pool or RAID group spanned an identical number of disks, so that the number of disks per LUN was the same in both configurations. Testing used nine disks per LUN to achieve greater load on each disk.

The LUN was sized to either 50% or 100% of the storage group. The LUN capacity was fully occupied with the I/O Analyzer worker VM and randomized data.  The I/O Analyzer Controller VM, which initiates the benchmark, was located on a separate array and host.

Storage Configuration Iometer with Storage Pool and RAID Group

Testing used one I/O Analyzer worker VM. One Iometer worker thread drove storage load. The size of the VM’s virtual disk determines the size of the active dataset, so a 100GB thick-provisioned virtual disk on VMFS-5 was chosen to maximize I/O to the disk and minimize caching. We tested at a medium load level using a plausible datacenter I/O profile, understanding, however, that any static I/O profile will be a broad generalization of real-life workloads.

Iometer Configuration

  • 1 vCPU, 2GB memory
  • 70% read, 30% write
  • 100% random I/O to model the “I/O blender effect” in a virtualized environment
  • 4KB block size
  • I/O aligned to sector boundaries
  • 64 outstanding I/O
  • 60 minute warm up period, 60 minute measurement period
Systems Under Test: One Dell PowerEdge R720 server
Configuration Per Server:  
     Virtualization Platform: VMware vSphere 6.0. Worker VM used the I/O Analyzer default virtual hardware version 7.
     CPUs: Two 12-core Intel® Xeon® E5-2697 v2 @ 2.7 GHz, Turbo Boost Enabled, up to 3.5 GHz, Hyper-Threading enabled.
     Memory: 256GB ECC DDR3 @ 1866MHz
     Host Bus Adapter: QLogic ISP2532 DualPort 8Gb Fibre Channel to PCI Express

Iometer results

Iometer Latency Results Storage Comparison RAID Group vs Storage PoolIometer Throughput Results Storage Comparison RAID Group vs Storage Pool

In Iometer testing, the storage pool showed slightly improved performance compared to the RAID group, and the amount of capacity allocated also did not affect performance.

In both our multi-workload and synthetic microbenchmark scenarios, we did not observe any performance penalty of choosing storage pools over RAID groups on an all-SSD array, even when disparate workloads shared the same storage pool. We also did not find any performance benefit at the application or I/O level from leaving unallocated capacity, or overprovisioning, SSD RAID groups or storage pools. Given the ease of management and feature-based benefits of storage pools, including automated storage tiering, compression, deduplication, and thin provisioning, storage pools are an excellent choice in today’s datacenters.

SQL Server VM Performance on VMware vSphere 6

Last October, I blogged about SQL Server performance with vSphere 5.5 using a four-socket Intel Xeon processor E7 based host.  Now that vSphere 6 is available, I’ve run an updated set of tests using this new release, on an even more powerful host, with Xeon E7 v2 processors.  A variety of virtual CPU (vCPU) and virtual machine (VM) quantities were tested to show that vSphere can handle hundreds of thousands of online transaction processing (OLTP) database operations per minute.

DVD Store 2.1, an open-source OLTP database stress tool, was the workload used to stress the VMs.  The first experiment in the paper was a generational performance comparison between the old and new setups; as you can see, there is a dramatic increase in throughput, even though the size of each VM has doubled from 8 vCPUs per VM to 16:

Generational performance improvement from old study to new study

There are also tests using CPU affinity to show the performance differences between physical cores and logical processors (Hyper-Threads), the benefit of “right-sizing” virtual machines, and measuring the impact of the advanced Latency Sensitivity setting. 

For more details and the test results, please download the whitepaper: Performance Characterization of Microsoft SQL Server on VMware vSphere 6.