Home > Blogs > VMware VROOM! Blog > Tag Archives: virtualization

Tag Archives: virtualization

Dynamic Host-Wide Performance Tuning in VMware vSphere 6.0

by Chien-Chia Chen


The networking stack of vSphere is, by default, tuned to balance the tradeoffs between CPU cost and latency to provide good performance across a wide variety of applications. However, there are some cases where using a tunable provides better performance. An example is Web-farm workloads, or any circumstance where a high consolidation ratio (lots of VMs on a single ESXi host) is preferred over extremely low end-to-end latency. VMware vSphere 6.0 introduces the Dynamic Host-Wide Performance Tuning  feature (also known as dense mode), which provides a single configuration option to dynamically optimize individual ESXi hosts for high consolidation scenarios under certain use cases. Later in this blog, we define those use cases. Right now, we take a look at how dense mode works from an internal viewpoint.

Mitigating Virtualization Inefficiency under High Consolidation Scenarios

Figure 1 shows an example of the thread contexts within a high consolidation environment. In addition to the Virtual CPUs (each labeled VCPU) of the VMs, there are per-VM vmkernel threads (device-emulation, labeled “Dev Emu”, threads in the figure) and multiple vmkernel threads for each Physical NIC (PNIC) executing physical device virtualization code and virtual switching code. One major source of virtualization inefficiency is the frequent context switches among all these threads. While context switches occur due to a variety of reasons, the predominant networking-related reason is Virtual NIC (VNIC) Interrupt Coalescing, namely, how frequently does the vmkernel interrupt the guest for new receive packets (or vice versa for transmit packets). More frequent interruptions are likely to result in lower per-packet latency while increasing virtualization overhead. At very high consolidation ratios, the overhead from increased interrupts hurts performance.

Dense mode uses two techniques to reduce the number of context switches:

  • The VNIC coalescing scheme will be changed to a less aggressive scheme called static coalescing.
    With static coalescing, a fixed number of requests are delivered in each batch of communication between the Virtual Machine Monitor (VMM) and vmkernel. This, in general, reduces the frequency of communication, thus fewer context switches, resulting in better virtualization efficiency.
  • The device emulation vmkernel thread wakeup opportunities are greatly reduced.
    The device-emulation threads now will only be executed either periodically with a longer timer or when the corresponding VCPUs are halted. This optimization largely reduces the frequency that device emulation threads being waken up, so frequency of context switch is also lowered.


Figure 1. High Consolidation Example

Enabling Dense Mode

Dense mode is disabled by default in vSphere 6.0. To enable it, change Net.NetTuneHostMode in the ESXi host’s Advanced System Settings (shown below in Figure 2) to dense.


Figure 2. Enabling Dynamic Host-Wide Performance Tuning
“default” is disabled; “dense” is enabled

Once dense mode is enabled, the system periodically checks the load of the ESXi host (every 60 seconds by default) based on the following three thresholds:

  • Number of VMs ≥ number of PCPUs
  • Number of VCPUs ≥ number of 2 * PCPUs
  • Total PCPU utilization ≥ 50%

When the system load exceeds the above thresholds, these optimizations will be in effect for all regular VMs that carry default settings. When the system load drops below any of the thresholds, those optimizations will be automatically removed from all affected VMs such that the ESXi host performs identical to when dense mode is disabled.

Applicable Workloads

Enabling dense mode can potentially impact performance negatively for some applications. So, before enabling, carefully profile the applications to determine whether or not the workload will benefit from this feature. Generally speaking, the feature improves the VM consolidation ratio on an ESXi host running medium network throughput applications with some latency tolerance and is CPU bounded. A good use case is Web-farm workload, which needs CPU to process Web requests while only generating a medium level of network traffic and having a few milliseconds of tolerance to end-to-end latency. In contrast, if the bottleneck is not at CPU, enabling this feature results in hurting network latency only due to less frequent context switching. For example, the following workloads are NOT good use cases of the feature:

  • X Throughput-intensive workload: Since network is the bottleneck, reducing the CPU cost would not necessarily improve network throughput.
  • X Little or no network traffic: If there is too little network traffic, all the dense mode optimizations barely have any effect.
  • X Latency-sensitive workload: When running latency-sensitive workloads, another set of optimizations is needed and is documented in the “Deploying Extremely Latency-Sensitive Applications in VMware vSphere 5.5” performance white paper.


To evaluate this feature, we implement a lightweight Web benchmark, which has two lightweight clients and a large number of lightweight Web server VMs. The clients send HTTP requests to all Web servers at a given request rate, wait for responses, and report the response time. The request is for static content and it includes multiple text and JPEG files totaling around 100KB in size. The Web server has memory caching enabled and therefore serves all the content from memory. Two different request rates are used in the evaluation:

  1. Medium request rate: 25 requests per second per server
  2. High request rate: 50 requests per second per server

In both cases, the total packet rate on the ESXi host is around 400 Kilo-Packets/Second (KPPS) to 700 KPPS in each direction, where the receiving packet rate is slightly higher than the transmitting packet rate.

System Configuration

We configured our systems as follows:

  • One ESXi host (running Web server VMs)
    • Machine: HP DL580 G7 server running vSphere 6.0
    • CPU: Four 10-core Intel® Xeon® E7-4870 @ 2.4 GHz
    • Memory: 512 GB memory
    • Physical NIC: Two dual-port Intel X520 with a total of three active 10GbE ports
    • Virtual Switching: One virtual distributed switch (vDS) with three 10GbE uplinks using default teaming policy
    • VM: Red Hat Linux Enterprise Server 6.3 assigned one VCPU, 1GB memory, and one VMXNET3 VNIC
  • Two Clients (generating Web requests)
    • Machine: HP DL585 G7 server running Red Hat Linux Enterprise Server 6.3
    • CPU: Four 8-core AMD Opteron™ 6212 @ 2.6 GHz
    • Memory: 128 GB memory
    • Physical NIC: One dual-port Intel X520 with one active 10GbE port on each client


Medium Request Rate

We first present the evaluation results for medium request rate workloads. Figures 3 and 4 below show the 95th-percentile response time and total host CPU utilization as the number of VMs increase, respectively. For the 95th-percentile response time, we consider 100ms as the preferred latency tolerance.

Figure 3 shows that at 100ms, default mode consolidates only about 470 Web server VMs, whereas dense mode consolidates more than 510 VMs, which is an over 10% improvement. For CPU utilization, we consider 90% is the desired maximum utilization.


Figure 3. Medium Request Rate 95-Percentile Response Time
(Latency Tolerance 100ms)

Figure 4 shows that at 90% utilization, default mode consolidates around 465 Web server VMs, whereas dense mode consolidates about 495 Web server VMs, which is still a nearly 10% improvement in consolidation ratio. We also notice that dense mode, in fact, also reduces response time. This is because the great reduction in context switching improves virtualization efficiency, which compensates the increase in latency due to more aggressive batching.


Figure 4. Medium Request Rate Host Utilization
(Desired Maximum Utilization 90%)

High Request Rate

Figures 5 and 6 below show the 95th-percentile response time and total host CPU utilization for a high request rate as the number of VMs increase, respectively. Because the request rate is doubled, we reduce the number of Web server VMs consolidated on the ESXi host. Figure 5 first shows that at 100ms response time, dense mode only consolidates about 5% more VMs in a medium request rate case (from ~280 VMs to ~290 VMs). However, if we look at the CPU utilization as shown in Figure 6, at 90% desired maximum load, dense mode still consolidates about 10% more VMs (from ~ 240 VMs to ~260 VMs). Considering both response time and utilization metrics, because there are a fewer number of active contexts under the high request rate workload, the benefit of reducing context switches will be less significant compared to a medium request rate case.


Figure 5. High Request Rate 95-Percentile Response Time
(Latency Tolerance 100ms)


Figure 6. High Request Rate Host Utilization
(Desired Maximum Utilization at 90%)


We presented the Dynamic Host-Wide Performance Tuning feature, also known as dense mode. We proved a Web-farm-like workload achieves up to 10% higher consolidation ratio while still meeting 100ms latency tolerance and 90% maximum host utilization. We emphasized that the improvements do not apply to every kind of application. Because of this, you should carefully profile the workloads before enabling dense mode.

VMware Virtual SAN Stretched Cluster Best Practices White Paper

VMware Virtual SAN 6.1 introduced the concept of a stretched cluster which allows the Virtual SAN customer to configure two geographically located sites, while synchronously replicating data between the two sites. A technical white paper about the Virtual SAN stretched cluster performance has now been published. This paper provides guidelines on how to get the best performance for applications deployed on a Virtual SAN stretched cluster environment.

The chart below, borrowed from the white paper, compares the performance of the Virtual SAN 6.1 stretched cluster deployment against the regular Virtual SAN cluster without any fault domains. A nine- node Virtual SAN stretched cluster is considered with two different configurations of inter-site latency: 1ms and 5ms. The DVD Store benchmark is executed on four virtual machines on each host of the nine-node Virtual SAN stretched cluster. The DVD Store performance metrics of cumulated orders per minute in the cluster, read/write IOPs, and average latency are compared with a similar workload on the regular Virtual SAN cluster. The orders per minute (OPM) is lower by 3% and 6% for the 1ms and 5ms inter-site latency stretched cluster compared to the regular Virtual SAN cluster.

Figure 1a.  DVD Store orders per minute in the cluster and guest IOPS comparison

Guest read/write IOPS and latency were also monitored. The read/write mix ratio for the DVD Store workload is roughly at 1/3 read and 2/3 write. Write latency shows an obvious increase trend when the inter-site latency is higher, while the read latency is only marginally impacted. As a result, the average latency increases from 2.4ms to 2.7ms, and 5.1ms for 1ms and 5ms inter-site latency configuration.

Figure 1b.  DVD Store latency comparison

These results demonstrate that the inter-site latency in a Virtual SAN stretched cluster deployment has a marginal performance impact on a commercial workload like DVD Store. More results are available in the white paper.

Large Receive Offload (LRO) Support for VMXNET3 Adapters with Windows VMs on vSphere 6

Large Receive Offload (LRO) is a technique to reduce the CPU time for processing TCP packets that arrive from the network at a high rate. LRO reassembles incoming packets into larger ones (but fewer packets) to deliver them to the network stack of the system. LRO processes fewer packets, which reduces its CPU time for networking. Throughput can be improved accordingly since more CPU is available to deliver additional traffic. On Windows, LRO is also referred to as Receive Segment Coalescing (RSC).

LRO has been supported for Linux VMs with kernel 2.6.24 and later since vSphere 4. With the introduction of Windows Server 2012 and Windows 8 supporting LRO, vSphere 6 now adds support for LRO on a VMXNET3 adapter on Windows VMs. LRO is especially beneficial in the virtualized environment in which resources are shared by multiple VMs. This blog shows the performance benefits of using LRO for Windows VMs running on vSphere 6.

Test-Bed Setup

The test bed consists of a vSphere 6.0 host running VMs and a client machine that drives workload generation. Both machines have dual-socket, 6-core 2.9GHz Intel Xeon E5-2667 (Sandy Bridge) processors. The client machine is configured with native Red Hat Enterprise Linux 6 that generates TCP flows. The VMs in the vSphere host run Windows 2012 Server and are configured with 4 vCPUs and 2GB RAM. Both machines have an Intel 82599EB 10Gbps adapter installed, which are connected using a 10 Gigabit Ethernet (GbE) switch.

Performance Results

1. Native to VM Traffic

Figures 1 and 2 show a CPU efficiency and throughput comparison with and without LRO when TCP streams are generated from the client machine to a VM running on the vSphere host. Netperf is used to generate traffic. Three different message sizes are used: 256 bytes, 16KB, 64KB. The socket size is set to 8K, 64K, and 256K respectively. The message size is used to determine the number of bytes that Netperf delivers to the TCP stack in the client machine, which then determines the actual packet sizes. The NIC in the client machine splits packets with a size larger than the MTU (1500 is used for this blog) into smaller MTU-sized ones before sending them out. Once packets are received in the vSphere host, LRO aggregates packets and delivers larger packets (but smaller in number) to the receiving VM running in the host. This process is done in either hardware (that is, the physical NIC) or software (that is, the vSphere networking stack) depending on the NIC type and configuration. In this blog, LRO is performed in the vSphere networking stack before packets are delivered to the VM.

CPU efficiency is calculated by dividing the throughput by the number of CPU cores used for both the hypervisor and the VM, representing how much throughput in gigabits per second (Gbps) a single CPU core can receive. For example, a CPU efficiency of 5 means the system can handle 5Gbps with one core. Therefore, a higher CPU efficiency is desirable.

As shown in Figures 1 and 2, using LRO considerably improves both CPU efficiency and throughput with all three message sizes. Figure 1 shows that CPU efficiency improves by 86%, 25%, and 33% with 256 byte, 16KB, and 64KB messages respectively, when compared to the case without LRO. Figure 2 shows throughput improves 54% for 256 byte messages (0.6Gbps to 0.9Gbps) and 5% for 16KB messages (9.0Gbps to 9.4Gbps). There is not much difference for 64KB packets (9.5Gbps to 9.4Gbps, this is within a rage of normal variance). With 16KB and 64KB messages, throughput with LRO is already at line rate (that is, 10Gbps), which is why the improvement is not as significant as CPU efficiency.

Figure 3 compares the packet rate and the average packet size between the messages with LRO and without LRO. They are measured right before packets are delivered to the VM (but after LRO is performed for the LRO configuration). It is clearly shown that fewer but larger packets are delivered to the VM with LRO. For example, with 64KB messages, the packet rate delivered to the VM decreases from 815K packets per second (pps) to 113Kpps with LRO, while the packet size increases from 1.5KB to 10.5KB. The number of interrupts generated for the guest also becomes smaller accordingly, helping to improve overall CPU efficiency and throughput.

When the client machine generates packets, those with a size larger than the MTU are split into smaller MTU-sized packets before being sent out. With a larger message size, more MTU-sized packets are produced and the packet rate received by the vSphere host increases accordingly. This is why the packet rate becomes higher for 16KB and 64KB messages than 256 byte messages without LRO in Figure 3. LRO aggregates the received packets before they hit the VM so the packet rate remains low regardless of the message size in the figure.

Figure 1

Figure 1. CPU efficiency comparison with and without LRO with Native-VM traffic

Figure 2. Throughput comparison with and without LRO with Native-VM traffic

Figure 2. Throughput comparison with and without LRO with Native-VM traffic

Figure 3. Packet rate and size comparison with and without LRO with Native-VM traffic, when packets are delivered to the VM

Figure 3. Packet rate and size comparison with and without LRO with Native-VM traffic, when packets are delivered to the VM

2. VM to VM Local Traffic

LRO is also beneficial in VM-VM local traffic where VMs are located in the same host, communicating with each other through a virtual switch. Figures 4 and 5 depict CPU efficiency and throughput comparisons with and without LRO with two VMs on the same host sending and receiving TCP flows. The same message and socket sizes as the Native-VM tests above are used.

From Figure 4, LRO improves CPU efficiency by 15%, 92%, and 90% with 256 bytes, 16KB, and 64KB byte messages respectively, when compared to the case without LRO. Figure 5 shows throughput also improves by 20% with 256 byte messages (0.8Gbps to 1.0Gbps), by 103% with 16KB messages (9.0Gbps to 18.4Gbps), and by 142% with 64KB (11.7Gbps to 28.4Gbps).

Without LRO, big packets with a size larger than the MTU need to be split before delivered to the receiving VM, similar to Native-VM traffic. This is because the receiver cannot handle those big packets. LRO saves the time spent in both splitting packets and receiving smaller packets since packet splitting also happens on the vSphere host with VM-VM Local traffic. This explains why the improvement with 16KB and 64KB messages is higher than that of Native-VM traffic. The absolute CPU efficiency in VM-VM local traffic can become lower than that in Native-VM traffic since the CPU time of both sending and receiving VMs are included for this calculation.

As expected, the packet rate decreases while the average packet size increases with LRO as shown in Figure 6. For example, with 64KB messages, the packet rate delivered to the VM becomes reduced from 1009Kpps to 240Kpps, while the packet size gets increased from 1.6KB to 14.9KB.

The packet rate becomes higher for 256 byte messages with LRO, most likely because round-trip time (RTT) gets reduced due to the use of LRO. With the average packet size being similar to each other between LRO and without LRO, this effectively helps to improve throughput and correspondingly CPU efficiency, as seen in Figure 4 and 5.

Figure 4. CPU efficiency with and without LRO with VM-VM Local traffic

Figure 4. CPU efficiency with and without LRO with VM-VM Local traffic

Figure 5. Throughput comparison with and without LRO with VM-VM Local traffic

Figure 5. Throughput comparison with and without LRO with VM-VM Local traffic

Figure 6. Packet rate and size comparison with and without LRO with VM-VM Local traffic, when packets are delivered to the VM

Figure 6. Packet rate and size comparison with and without LRO with VM-VM Local traffic, when packets are delivered to the VM

Enable or Disable LRO on a VMXNET3 Adapter on a Windows VM

LRO is enabled by default for VMXNET3 adapters on vSphere 6.0 hosts, but you must set RSC to be enabled globally for Windows 8 VMs. For more information about configuring this, see the documentation.


This blog shows that enabling LRO for Windows Server 2012 and Windows 8 VMs on a vSphere host using VMXNET3 considerably enhances the CPU efficiency and correspondingly improves throughput for TCP traffic.

Virtualized Storage Performance: RAID Groups versus Storage pools

RAID, a redundant array of independent disks, has traditionally been the foundation of enterprise storage. Grouping multiple disks into one logical unit can vastly increase the availability and performance of storage by protecting against disk failure, allowing greater I/O parallelism, and pooling capacity. Storage pools similarly increase the capacity and performance of storage, but are easier to configure and manage than RAID groups.

RAID groups have traditionally been regarded as offering better and more predictable performance than storage pools. Although both technologies were developed for magnetic hard disk drives (HDDs), solid-state drives (SSDs), which use flash memory, have become prevalent. Virtualized environments are also common and tend to create highly randomized I/O given the fact that multiple workloads are run simultaneously.

We set out to see how the performance of RAID group and storage pool provisioning methods compare in today’s virtualized environments.

First, let’s take a closer look at each storage provisioning type.

RAID Groups

A RAID group unifies a number of disks into one logical unit and distributes data across multiple drives. RAID groups can be configured with a particular protection level depending on the performance, capacity, and redundancy needs of the environment. LUNs are then allocated from the RAID group. RAID groups typically contain only identical drives, and the maximum number of disks in a RAID group varies by system model but is generally below fifty. Because drives typically have well defined performance characteristics, the overall RAID group performance can be calculated as the performance of all drives in the group minus the RAID overhead. To provide consistent performance, workloads with different I/O profiles (e.g., sequential vs. random I/O) or different performance needs should be physically isolated in different RAID groups so they do not share disks.

Storage Pools

Storage pools, or simply ‘pools’, are very similar to RAID groups in some ways. Implementation varies by vendor, but generally pools are made up of one or more private RAID groups, which are not visible to the user, or they are composed of user-configured RAID groups which are added manually to the pool. LUNs are then allocated from the pool. Storage pools can contain up to hundreds of drives, often all the drives in an array. As business needs grow, storage pools can be easily scaled up by adding drives or RAID groups and expanding LUN capacity. Storage pools can contain multiple types and sizes of drives and can spread workloads over more drives for a greater degree of parallelism.

Storage pools are usually required for array features like automated storage tiering, where faster SSDs can serve as a data cache among a larger group of HDDs, as well as other array-level data services like compression, deduplication, and thin provisioning. Because of their larger maximum size, storage pools, unlike RAID groups, can take advantage of vSphere 6 maximum LUN sizes of 64TB.

We used two benchmarks to compare the performance of RAID groups and storage pools: VMmark, which is a virtualization platform benchmark, and I/O Analyzer with Iometer, which is a storage microbenchmark.  VMmark is a multi-host virtualization benchmark that uses diverse application workloads as well as common platform level workloads to model the demands of the datacenter. VMs running a complete set of the application workloads are grouped into units of load called tiles. For more details, see the VMmark 2.5 overview. Iometer places high levels of load on the disk, but does not stress any other system resources. Together, these benchmarks give us both a ‘real-world’ and a more focused perspective on storage performance.

VMmark Testing

Array Configuration

Testing was conducted on an EMC VNX5800 block storage SAN with Fibre Channel. This was one of the many storage solutions which offered both RAID group and storage pool technologies. Disks were 200GB single-level cell (SLC) SSDs. Storage configuration followed array best practices, including balancing LUNs across Storage Processors and ensuring that RAID groups and LUNs did not span the array bus. One way to optimize SSD performance is to leave up to 50% of the SSD capacity unutilized, also known as overprovisioning. To follow this best practice, 50% of the RAID group or storage pool was not allocated to any LUN. Since overprovisioning SSDs can be an expensive proposition, we also tested the same configuration with 100% of the storage pool or RAID group allocated.

RAID Group Configuration

Four RAID 5 groups were used, each composed of 15 SSDs. RAID 5 was selected for its suitability for general purpose workloads. RAID 5 provides tolerance against a single disk failure. For best performance and capacity, RAID 5 groups should be sized to multiples of five or nine drives, so this group maintains a multiple of the preferred five-drive count. One LUN was created in each of the four RAID groups. The LUN was sized to either 50% of the RAID group (Best Practices) or 100% (Fully Allocated). For testing, the capacity of each LUN was fully utilized by VMmark virtual machines and randomized data.

            RAID Group Configuration VMmark Storage Comparison        VMmark Storage Pool Configuration Storage Comparison

Storage Pool Configuration

A single RAID 5 Storage Pool containing all 60 SSDs was used. Four thick LUNs were allocated from the pool, meaning that all of the storage space was reserved on the volume. LUNs were equivalent in size and consumed a total of either 50% (Best Practices) or 100% (Fully Allocated) of the pool capacity.

Storage Layout

Most of the VMmark storage load was created by two types of virtual machines: database (DVD Store) and mail server (Microsoft Exchange). These virtual machines were isolated on two different LUNs. The remaining virtual machines were spread across the remaining two LUNs. That is, in the RAID group case, storage-heavy workloads were physically isolated in different RAID groups, but in the storage pool case, all workloads shared the same pool.

Systems Under Test: Two Dell PowerEdge R720 servers
Configuration Per Server:  
     Virtualization Platform: VMware vSphere 6.0. VMs used virtual hardware version 11 and current VMware Tools.
     CPUs: Two 12-core Intel® Xeon® E5-2697 v2 @ 2.7 GHz, Turbo Boost Enabled, up to 3.5 GHz, Hyper-Threading enabled.
     Memory: 256GB ECC DDR3 @ 1866MHz
     Host Bus Adapter: QLogic ISP2532 DualPort 8Gb Fibre Channel to PCI Express
     Network Controller: One Intel 82599EB dual-port 10 Gigabit PCIe Adapter, one Intel I350 Dual-Port Gigabit PCIe Adapter

Each configuration was tested at three different load points: 1 tile (the lowest load level), 7 tiles (an approximate mid-point), and 13 tiles, which was the maximum number of tiles that still met Quality of Service (QoS) requirements. All datapoints represent the mean of two tests of each configuration.

VMmark Results

RAID Group vs. Storage Pool Performance comparison using VMmark benchmark

Across all load levels tested, the VMmark performance score, which is a function of application throughput, was similar regardless of storage provisioning type. Neither the storage type used nor the capacity allocated affected throughput.

VMmark 2.5 performance scores are based on application and infrastructure workload throughput, while application latency reflects Quality of Service. For the Mail Server, Olio, and DVD Store 2 workloads, latency is defined as the application’s response time. We wanted to see how storage configuration affected application latency as opposed to the VMmark score. All latencies are normalized to the lowest 1-tile results.

Storage configuration did not affect VMmark application latencies.

Application Latency in VMmark Storage Comparison RAID Group vs Storage Pool

Lastly, we measured read and write I/O latencies: esxtop Average Guest MilliSec/Write and Average Guest MilliSec/Read. This is the round trip I/O latency as seen by the Guest operating system.

VMmark Storage Latency Storage Comparison RAID Group vs Storage Pool

No differences emerged in I/O latencies.

I/O Analyzer with Iometer Testing

In the second set of experiments, we wanted to see if we would find similar results while testing storage using a synthetic microbenchmark. I/O Analyzer is a tool which uses Iometer to drive load on a Linux-based virtual machine then collates the performance results. The benefit of using a microbenchmark like Iometer is that it places heavy load on just the storage subsystem, ensuring that no other subsystem is the bottleneck.


Testing used a VNX5800 array and RAID 5 level as in the prior configuration, but all storage configurations spanned 9 SSDs, also a preferred drive count. In contrast to the prior test, the storage pool or RAID group spanned an identical number of disks, so that the number of disks per LUN was the same in both configurations. Testing used nine disks per LUN to achieve greater load on each disk.

The LUN was sized to either 50% or 100% of the storage group. The LUN capacity was fully occupied with the I/O Analyzer worker VM and randomized data.  The I/O Analyzer Controller VM, which initiates the benchmark, was located on a separate array and host.

Storage Configuration Iometer with Storage Pool and RAID Group

Testing used one I/O Analyzer worker VM. One Iometer worker thread drove storage load. The size of the VM’s virtual disk determines the size of the active dataset, so a 100GB thick-provisioned virtual disk on VMFS-5 was chosen to maximize I/O to the disk and minimize caching. We tested at a medium load level using a plausible datacenter I/O profile, understanding, however, that any static I/O profile will be a broad generalization of real-life workloads.

Iometer Configuration

  • 1 vCPU, 2GB memory
  • 70% read, 30% write
  • 100% random I/O to model the “I/O blender effect” in a virtualized environment
  • 4KB block size
  • I/O aligned to sector boundaries
  • 64 outstanding I/O
  • 60 minute warm up period, 60 minute measurement period
Systems Under Test: One Dell PowerEdge R720 server
Configuration Per Server:  
     Virtualization Platform: VMware vSphere 6.0. Worker VM used the I/O Analyzer default virtual hardware version 7.
     CPUs: Two 12-core Intel® Xeon® E5-2697 v2 @ 2.7 GHz, Turbo Boost Enabled, up to 3.5 GHz, Hyper-Threading enabled.
     Memory: 256GB ECC DDR3 @ 1866MHz
     Host Bus Adapter: QLogic ISP2532 DualPort 8Gb Fibre Channel to PCI Express

Iometer results

Iometer Latency Results Storage Comparison RAID Group vs Storage PoolIometer Throughput Results Storage Comparison RAID Group vs Storage Pool

In Iometer testing, the storage pool showed slightly improved performance compared to the RAID group, and the amount of capacity allocated also did not affect performance.

In both our multi-workload and synthetic microbenchmark scenarios, we did not observe any performance penalty of choosing storage pools over RAID groups on an all-SSD array, even when disparate workloads shared the same storage pool. We also did not find any performance benefit at the application or I/O level from leaving unallocated capacity, or overprovisioning, SSD RAID groups or storage pools. Given the ease of management and feature-based benefits of storage pools, including automated storage tiering, compression, deduplication, and thin provisioning, storage pools are an excellent choice in today’s datacenters.

Scaling Web 2.0 Applications using Docker containers on vSphere 6.0

by Qasim Ali

In a previous VROOM post, we showed that running Redis inside Docker containers on vSphere adds little to no overhead and observed sizeable performance improvements when scaling out the application when compared to running containers directly on the native hardware. This post analyzes scaling Web 2.0 applications using Docker containers on vSphere and compares the performance of running Docker containers on native and vSphere. This study shows that Docker containers add negligible overhead when run on vSphere, and also that the performance using virtual machines is very close to native and, in certain cases, slightly better due to better vSphere scheduling and isolation.

Web 2.0 applications are an integral part of Enterprise and small business IT offerings. We use the CloudStone benchmark, which simulates a typical Web 2.0 technology use in the workplace for our study [1] [2]. It includes a Web 2.0 social-events application (Olio) and a client implemented using the Faban workload generator [3]. It is an open source benchmark that simulates activities related to social events. The benchmark consists of three main components: a Web server, a database backend, and a client to emulate real world accesses to the Web server. The overall architecture of CloudStone is depicted in Figure 1.

Figure 1: CloudStone architecture

The benchmark reports latency for various user actions. These metrics were compared against a fixed threshold. Studies indicate that users are less likely to visit a Web site if the response time is greater than 250 milliseconds [4]. This number can be used as an upper bound for latency for frequent operations (Home-Page, TagSearch, EventDetail, and Login). For the less frequent operations (AddEvent, AddPerson, and PersonDetail), a less restrictive threshold of 500 milliseconds can be used. Table 1 shows the exact mix/frequency of various operations.

Operation Number of Operations Mix
HomePage 141908 26.14%
Login 55473 10.22%
TagSearch 181126 33.37%
EventDetail 134144 24.71%
PersonDetail 14393 2.65%
AddPerson 4662 0.86%
AddEvent 11087 2.04%

Table 1: CloudStone operations frequency for 1500 users

Benchmark Components and Experimental Set up

The test system was installed with a CloudStone implementation of a MySQL database, NGINX Web server with PHP scripts, and a Tomcat application server provided by the Faban harness. The default configuration was used for the workload generator. All components of the application ran on a single host, and the client ran in a separate virtual machine on a separate host. Both hosts were connected using a direct link between a pair of 10Gbps NICs. One client-server pair provided a single, independent CloudStone instance. Scaling was achieved by running additional instances of CloudStone.

Deployment Scenarios

We used the following three deployment scenarios for this study:

  • Native-Docker: One or more CloudStone instances were run inside Docker containers (2 containers per CloudStone instance: one for the Web server and another for the database backend) running on the native OS.
  • VM: CloudStone instances were run inside one or more virtual machines running on vSphere 6.0; the guest OS is the same as the native scenario.
  • VM-Docker: CloudStone instances were run inside Docker containers that were running inside one or more virtual machines.

Hardware/Software/Workload Configuration

The following are the details about the hardware and software used in the various experiments discussed in the next section:

Server Host:

  • Dell PowerEdge R820
  • CPU: 4 x Intel® Xeon® CPU E5-4650 @ 2.30GHz (32 cores, 64 hyper-threads)
  • Memory: 512GB
  • Hardware configuration: Hyper-Threading (HT) ON, Turbo-boost ON, Power policy: Static High (that is, no power management)
  • Network: 10Gbps
  • Storage: 7 x 250GB 15K RPM 4Gb SAS  Disks

Client Host:

  • Dell PowerEdge R710
  • CPU: 2 x Intel® Xeon® CPU X5680 @ 3.33GHz (12 cores, 24 hyper-threads)
  • Memory: 144GB
  • Hardware configuration: HT ON, Turbo-boost ON, Power policy: Static High (that is, no power management)
  • Network: 10Gbps
  • Client VM: 2-vCPU 4GB vRAM

Host OS:

  • Ubuntu 14.04.1
  • Kernel 3.13

Docker Configuration:

  • Docker 1.2
  • Ubuntu 14.04.1 base image
  • Host volumes for database and images
  • Configured with host networking to avoid Docker NAT overhead
  • Device mapper as the storage backend driver


  • VMware vSphere 6.0 (pre-release build)

VM Configurations:

  • Single VM: An 8-vCPU 4GB VM ( Web server and database running in a single VM)
  • Two VMs: One 6-vCPU 2GB Web server VM and one 2-vCPU 2GB database VM (CloudStone instance running in two VMs)
  • Scale-out: Eight 8-vCPU 4GB VMs

Workload Configurations:

  • The NGINX Web server was configured with 4 worker processes and 4096 connections per worker.
  • PHP was configured with a maximum number of 16 child processes.
  • The Web server and the database were preconfigured with 1500 users per CloudStone instance.
  • A runtime of 30 minutes with a 5 minute ramp-up and ramp-down periods (less than 1% run-to-run  variation) was used.


First, we ran a single instance of CloudStone in the various configurations mentioned above. This was meant to determine the raw overhead of Docker containers running on vSphere vs. the native configuration, eliminating scheduling differences. Second, we picked the configuration that performed best in a single instance and scaled it out to run multiple instances.

Figure 2 shows the mean latency of the most frequent operations and Figure 3 shows the mean latency of less frequent operations.

Figure 2: Results of single instance CloudStone experiments for frequently used operations


Figure 3: Results of single instance CloudStone experiments for less frequently used

We configured the benchmark to use a single VM and deployed the Web server and database applications in it (configuration labelled VM-1VM in Figure 2 and 3). We then ran the same workload in Docker containers in a single VM (VMDocker-1VM). The latencies are slightly higher than native, which is expected due to some virtualization overhead. However, running Docker containers on a VM showed no additional overhead. In fact, it seems to be slightly better. We believe this might be due to the device mapper using twice as much page cache as the VM (device mapper uses a loopback device to mount the file system and, hence, data ends up being cached twice in the buffer cache). We also tried AuFS as a storage backend for our container images, but that seemed to add some CPU and latency overhead, and, for this reason, we switched to device mapper. We then configured the VM to use the vSphere Latency Sensitivity feature [5] (VM-1VM-lat and VMDocker-1VM-lat labels). As expected, this configuration reduced the latencies even further because each vCPU got exclusive access to a core and this reduced scheduling overhead.  However, this feature cannot be used when the VM (or VMs) has more vCPUs than the number of cores available on the system (that is, the physical CPUs are over-committed) because each vCPU needs exclusive access to a core.

Next, we configured the workload to use two VMs, one for the Web server and the other for the database application. This configuration ended up giving slightly higher latencies because the network packets have to traverse the virtualization layer from one VM to the other, while in the prior experiments they were confined within the same VM.

Finally, we scaled out the CloudStone workload with 12,000 users by using eight 8-vCPU VMs with 1500 users per instance. The VM configurations were the same as the VM-1VM and VMDocker-1VM cases above. The average system CPU core utilization was around 70-75%, which is the typical average CPU utilization for latency sensitive workloads because it allows for headroom to absorb traffic bursts. Figure 4 reports mean latencies of all operations (latencies were averaged across all eight instances of CloudStone for each operation), while Figure 5 reports the 90th percentile latencies (the benchmark reports these latencies in 20 millisecond granularity as evident from Figure 5.)


Figure 4: Scale-out experiments using eight instances of CloudStone (mean latency)


Figure 5: Scale-out experiments using eight instances of CloudStone (90th percentile latency)

The latencies shown in Figure 4 and 5 are well below the 250 millisecond threshold. We observed that the latencies on vSphere are very close to native or, in certain cases, slightly better than native (for example, Login, AddPerson and AddEvent operations). The latencies were better than native due to better vSphere scheduling and isolation, resulting in better cache/memory locality. We verified this by pinning container instances on specific sockets and making the native scheduler behavior similar to vSphere. After doing that, we observed that latencies in the native case got better and they were similar or slightly better than vSphere.

Note: Introducing artificial affinity between processes and cores is not a recommended practice because it is error-prone and can, in general, lead to unexpected or suboptimal results.


VMs and Docker containers are truly “better together.” The CloudStone scale-out system, using out-of-the-box VM and VM-Docker configurations, clearly achieves very close to, or slightly better than, native performance.


[1] W. Sobel, S. Subramanyam, A. Sucharitakul, J. Nguyen, H. Wong, A. Klepchukov, S. Patil, O. Fox and D. Patterson, “CloudStone: Multi-Platform, Multi-Languarge Benchmark and Measurement Tools for Web 2.8,” 2008.
[2] N. Grozev, “Automated CloudStone Setup in Ubuntu VMs / Advanced Automated CloudStone Setup in Ubuntu VMs [Part 2],” 2 June 2014. https://nikolaygrozev.wordpress.com/tag/cloudstone/.
[3] java.net, “Faban Harness and Benchmark Framework,” 11 May 2014. http://java.net/projects/faban/.
[4] S. Lohr, “For Impatient Web Users, an Eye Blink Is Just Too Long to Wait,” 29 February 2012. http://www.nytimes.com/2012/03/01/technology/impatient-web-usersflee-slow-loading-sites.html.
[5] J. Heo, “Deploying Extremely Latency-Sensitive Applications in VMware vSphere 5.5,” 18 September 2013.   http://blogs.vmware.com/performance/2013/09/deploying-extremely-latency-sensitive-applications-in-vmware-vsphere-5-5.html.




Introducing the Zephyr Benchmark

The ways in which we use, design, deploy, and evaluate the performance of large-scale web applications have changed significantly in recent years.  These changes have been driven by the increase in computing capacity and flexibility provided by virtualized and cloud-based computing infrastructures. The majority of these changes are not reflected in current web-application benchmarks.

Zephyr is a new web-application benchmark we have been developing as part of our work on optimizing the performance of VMware products for the next generation of cloud-scale applications. The goal of the Zephyr project has been to develop an application-level benchmark that captures the key characteristics of the workloads, design paradigms, deployment architectures, and performance metrics of the next generation of large-scale web applications. As we approach the initial release of Zephyr, we are starting to use it to understand performance across our product range.  In this post, we will give an overview of Zephyr that will provide context for the performance results that we will be writing about over the coming months.

Zephyr Motivation

There have been many changes in usage patterns and development practices for large-scale web applications.  The design and development of Zephyr has been driven by the goal of capturing these changes in a highly scalable benchmark that includes these key aspects:

  • The effect of increased user interactivity and rich web interfaces on workload patterns
  • New design patterns for decoupled and asynchronous services
  • The use of multiple data sources for data with varying scalability and consistency requirements
  • Flexible architectures that allow for deployment on a wide range of virtual and cloud-based infrastructures

The effect of increased user interactivity and rich web interfaces is one of the most important of these aspects. In current benchmarks, a user is represented by a single thread operating independently from other users. Contrast that to the way we interact with applications as diverse as social media and stock trading. Many user interactions, such as responding to a status update or selling shares of stock, are in direct response to the actions of other users.  In addition, the current generation of script-rich web interfaces performs many operations asynchronously without any action from, or even awareness by, the user.  Examples include web pages and rich client interfaces that update active friend lists, check for messages, or maintain stock tickers.  This leads to a very different model of user behavior than the traditional single-threaded, click-and-think design used by existing benchmarks.  As a result, one of the key design goals for Zephyr was to develop both a benchmark application and a workload generator that would allow us to capture the effect of these new workload patterns.

Zephyr Overview

An application-level benchmark typically consists of two main parts: the benchmark application and the workload driver.  The application is selected and designed to represent characteristics and technology choices that are typical of a certain class of applications.  The workload driver interacts with the benchmark application to simulate the behavior of typical users of the application.   It also captures the performance metrics that are used to quantify the performance of the application/infrastructure combination. Some benchmarks, including Zephyr, also provide a run harness that assists in the set-up and automation of benchmark runs.

Zephyr’s benchmark application is LiveAuction, which is a web application for managing and hosting real-time auctions. An auction hosted by LiveAuction consists of a number of items that will be placed up for bid in a set order.  Users are given only a limited time to bid before an item is sold and the next item is placed up for bid.  When an item is up for bid, all users attending the auction are presented with a description and image of the item.  Users see and respond to bids placed by other users. LiveAuction can support thousands of simultaneous auctions with large numbers of active users, with each user possibly attending multiple, simultaneous auctions.   The figure below shows the browser application used to interact with the LiveAuction application.  This figure shows the bidding screen for a user who is attending two auctions.  The current item, bid, and bid status for each auction are updated in real-time in response to bids placed by other users.

LiveAuctionScreenFigure 1. LiveAuction bidding screen

In addition to managing live auctions, LiveAuction provides auction and item search, profile management, historical data queries, image management, auction management, and other services that would be required by a user of the application.

LiveAuction uses a scalable architecture that allows deployments to be easily sized for a large range of user loads.  A full deployment of LiveAuction includes a wide variety of support services, such as load-balancing, caching, and messaging servers, as well as relational, NoSQL, and filesystem-based data stores supporting scalability for data with a variety of consistency requirements.  The figure below shows a full deployment of LiveAuction and the Zephyr workload driver.

logicalLayoutFullFigure 2. Logical layout for full Zephyr deployment

The following is a brief description of the role played by each tier.

Infrastructure Services

TCP Load Balancers: The simulated users on the workload driver address the application through a set of IP addresses mapped to the application’s external hostname.  The TCP load balancers jointly manage these IP addresses to ensure that all IP addresses remain available in the event of a failure. The TCP load balancers distribute the load across the web servers while maintaining SSL/TLS session affinity.

Messaging Servers: The application nodes use the messaging backbone to distribute work and state-change information regarding active auctions.

Application Services

Web Servers: The web servers terminate SSL, serve static content, act as load-balancing reverse proxies for the application servers, and provide a proxy cache for application content, such as images returned by the application servers.

Application Servers: The application servers run Java servlet containers in which the application services are deployed.  The LiveAuction application services use a stateless implementation with a RESTful interface that simplifies scaling.

Data Services

Relational Database: The relational database is used for all data that is involved in transactions.  This includes user account information, as well as auction, item, and high-bid data.

NoSQL Data Server:  The NoSQL Document Store is used to store image metadata as well as activity data such as auction attendance information and bid records. It can also be used to store uploaded images. Using the NoSQL store as an image store allows the application to take advantage of its sharding capabilities to easily scale the I/O capacity for image storage.

File Server: The file server is used exclusively to store item images uploaded by users.  Note that the file server is optional, as the images can be stored and served from the NoSQL document store.

Zephyr currently includes configuration support for deploying LiveAuction using the following services:

  • Virtual IP Address Management: Keepalived
  • TCP Load Balancer: HAProxy
  • Web Server: Apache Httpd and Nginx
  • Application Server:  Apache Tomcat with EHcache for in-memory caching
  • Messaging Server: RabbitMQ
  • Relational Database: MySQL and PostgreSQL
  • NoSQL Data Store: MongoDB
  • Network Filesystem: NFS

Additional implementations will be supported in future releases.

Zephyr can be deployed with different subsets of the infrastructure and application services.  For example, the figure below shows a minimal deployment of Zephyr with a single application server and the supporting data services.  In this configuration, the application server performs the tasks handled by the web server in a larger deployment.

logicalLayoutMinimalFigure 3. Logical layout for a minimal Zephyr deployment

The Zephyr workload driver has been developed to drive HTTP-based loads for modern scalable web applications.  It can simulate workloads for applications that incorporate asynchronous behaviors using embedded JavaScript, and those requiring complex data-driven behaviors, as in web applications with significant inter-user interaction.  The Zephyr workload driver uses an asynchronous design with a small number of threads supporting a large number of simulated users. Simulated users may have multiple active asynchronous activities which share state information, and complex workload patterns can be specified with control-flow decisions made based on retrieved state and operation history. These features allow us to efficiently simulate workloads that would be presented to web applications by rich web clients using asynchronous JavaScript operations.

The Zephyr workload driver also monitors quality-of-service (QoS) metrics for both the LiveAuction application and the overall workload. The application-level QoS requirements are based on the 99th percentile response-times for the individual operations.  An operation represents a single action performed by a user or embedded script, and may consist of multiple HTTP exchanges.  The workload-level QoS requirements define the required mix of operations that must be performed by the users during the workload’s steady state.  This mix must be consistent from run to run in order for the results to be comparable.  In order for a run of the benchmark to pass, all QoS requirements must be satisfied.

Zephyr also includes a run harness that automates most of the steps involved in configuring and running the benchmark.  The harness takes as input a configuration file that describes the deployment configuration, the user load, and many service-specific tuning parameters.  The harness is then able to power on virtual machines, configure and start the various software services, deploy the software components of LiveAuction, run the workload, and collect the results, as well as the log, configuration, and statistics files from all of the virtual machines and services.  The harness also manages the tasks involved in loading and preparing the data in the data services before each run.


Scaling to large deployments is a key goal of Zephyr.  Therefore, it will be useful to conclude with some initial scalability data to show how we are doing in achieving that goal. There are many possible ways to scale up a deployment of LiveAuction.  For the sake of providing a straightforward comparison, we will focus on scaling out the number of application server instances in an otherwise fixed deployment configuration.  The CPU utilization of the application server is typically the performance bottleneck in a well-balanced LiveAuction deployment.

The figure below shows the logical layout of the VMs and services in this deployment.  Physically, all VMs reside on the same network subnet on the vSphere hosts, which are connected by a 10Gb Ethernet switch.

Blog1LayoutFigure 4. Deployment configuration for scaling results

The VMs in the LiveAuction deployment were distributed across three VMware vSphere 6 hosts.  Table 1 gives the hardware details of the hosts.

Host Name Host Vendor/Model Processors Memory
Host1 Dell PowerEdge R720
2-Socket Server
Intel® Xeon® CPU E5-2690 @ 2.90GHz
8 Core, 16 Thread
Host2 Dell PowerEdge R720
2-Socket Server
Intel® Xeon® CPU E5-2690 @ 2.90GHz
8 Core, 16 Thread
Host3 Dell PowerEdge R720
2-Socket Server
Intel® Xeon® CPU E5-2680 @ 2.70GHz
8 Core, 16 Thread

Table 1. vSphere 6 hosts for LiveAuction deployment

Table 2 shows the configuration of the VMs, and their assignment to vSphere hosts.  As the goal of these tests was to examine the scalability of the LiveAuction application, and not the characteristics of vSphere 6, we chose the VM sizing and assignment in part to avoid using more virtual CPUs than physical cores. While we did some tuning of the overall configuration, we did not necessarily obtain the optimal tuning for each of the service configurations.  The configuration was chosen so that the application server was the bottleneck as far as possible within the restrictions of the available physical servers.  In future posts, we will examine the tuning of the individual services, tradeoffs in deployment configurations, and best practices for deploying LiveAuction-like applications on vSphere.

Service Host VM vCPUs (each) VM Memory
HAProxy 1 Host1 2 8GB
HAProxy 2 Host2 2 8GB
HAProxy 3 Host3 2 8GB
Nginx 1, 2, and 3 Host3 2 8GB
RabbitMQ 1 Host2 1 2GB
RabbitMQ 2 Host1 1 2GB
Tomcat 1, 3, 5, 7, and 9 Host1 2 8GB
Tomcat 2, 4, 6, 8, and 10 Host2 2 8GB
MongoDB 1 and 3 Host2 1 32GB
MongoDB 2 and 4 Host1 1 32GB
PostgreSQL Host3 6 32GB

Table 2. Virtual machine configuration

Figure 5 shows the peak load that can be supported by this deployment configuration as the number of application servers is scaled from one to ten.  The peak load supported by a configuration is the maximum load at which the configuration can satisfy all of the QoS requirements of the workload.  The dotted line shows linear scaling of the maximum load extrapolated from the single application server result.  The actual scaling is essentially linear up to six application-server VMs.  At that point, the overall utilization of the physical servers starts to affect the ability to maintain linear scaling.  With seven application servers, the web-server tier becomes a scalability bottleneck, but there are not sufficient CPU cores available to add additional web servers.

It would require additional infrastructure to determine how far the linear scaling could be extended.  However, the current results provide strong evidence that with sufficient resources, Zephyr will be able to scale to support very large loads representing large numbers of users.

scalabilityFigure 5. Maximum supported users for increasing number of application servers


The discussion in this post has focused on the use of Zephyr as a traditional single-application benchmark with a focus on throughput and response-time performance metrics.  However, that only scratches the surface of our future plans for Zephyr.  We are currently working on extending Zephyr to capture more cloud-centric performance metrics.  These fall into two broad categories that we call multi-tenancy metrics and elasticity metrics.  Multi-tenancy metrics capture the performance characteristics of a cloud-deployed application in the presence of other applications co-located on the same physical resources.  The relevant performance metrics include isolation and fairness along with the traditional throughput and response-time metrics.  Elasticity metrics capture the performance characteristics of self-scaling applications in the presence of changing loads.  It is also possible to study elasticity metrics in the context of multi-tenancy environments, thus examining the impact of shared resources on the ability of an application to scale in a timely manner to satisfy user demands.  These are all exciting new areas of application performance, and we will have more to say about these subjects as we approach Zephyr 1.0.

VMware vSphere 6 and Oracle 12c Scalability Study: Scaling Monster Virtual Machines

vSphere 6 introduces the ability to run virtual machines (VMs) with up to 128 virtual CPUs (vCPUs) and 4TB of RAM. This doubles the number of vCPUs supported from the previous version and increases the amount of RAM by four times. This new capability provides the potential for customers to run larger workloads than ever before in a virtual machine.

A series of tests were run with a virtual machine hosting Oracle 12c database instances. The DVD Store 2.1 open-source transactional workload was used to measure the performance of a large “Monster” VM on vSphere 6. The Oracle 12c database VM was scaled from 15 vCPUs all the way up to 120 vCPUs, and the maximum achieved throughput was measured. The full results and test details have been published in a white paper – VMware vSphere 6 and Oracle 12c Scalability Study: Scaling Monster Virtual Machines.

A four-socket Intel Xeon E7-4890 v2 processor based server with 1TB of memory was used to host the virtual machine for the tests.  Each Xeon E7-4890 v2 processor has 15 cores / 30 threads with Hyper Threading enabled for a total of 60 cores / 120 threads for the system. The diagram below shows the basic test configuration.



In all tests Hyper-Threading was enabled on the server, but in configurations where 60 vCPUs or less are assigned to the VM, Hyper-Threads are not used by the VM. This is a result of the default scheduling policy where the preference is for vCPUs to be scheduled on one thread per core before using the second thread of any core. This first set of results, shown below, is focused on the tests that scale up to 60 vCPUs. These tests show the scaling for the virtual machine without the use of Hyper-Threads


While vSphere 6 supports up to 128 vCPUs per VM, these tests were limited to 120 vCPUs due to the number of threads available on the server. The largest VM configuration used both hardware execution threads (Hyper-Threads) on all the processor cores in order to reach 120 vCPUs. In this case, there is one vCPU per execution thread.

Hyper-Threading doubles the number of execution threads, but it does not double performance. In order to measure the scale-up performance of the 120-vCPU VM, a 60-vCPU VM was configured with CPU affinity so that it was limited to only two of the server’s four sockets. In this configuration the 60-vCPU VM has one vCPU per execution thread, which is the same as the 120-vCPU VM.  Configuring a 60-vCPU VM in this way makes it easy to see the scale up performance at 120 vCPUs on this server with hyper-threads enabled.

The results of the scale-up testing using the 60-vCPU VM configured with CPU affinity to only 2 sockets and the 120-vCPU VM using all four sockets showed approximately linear scaling, as shown in the graph below.


For full test details and more test results please see the white paper that has was recently published.

The new larger “Monster” VM support in vSphere 6 allows for virtual machines that can support larger workloads than ever before with excellent performance. These tests show that large virtual machines running on vSphere 6 can scale up as needed to meet extreme performance demands.


Docker Containers Performance in VMware vSphere

By  Qasim Ali,  Banit Agrawal, and Davide Bergamasco


“Containers without compromise” – This was one of the key messages at VMworld 2014 USA in San Francisco. It was presented in the opening keynote, and then the advantages of running Docker containers inside of virtual machines were discussed in detail in several breakout sessions. These include security/isolation guarantees and also the existing rich set of management functionalities. But some may say, “These benefits don’t come for free: what about the performance overhead of running containers in a VM?”

A recent report compared the performance of a Docker container to a KVM VM and showed very poor performance in some micro-benchmarks and real-world use cases: up to 60% degradation. These results were somewhat surprising to those of us accustomed to near-native performance of virtual machines, so we set out to do similar experiments with VMware vSphere. Below, we present our findings of running Docker containers in a vSphere VM and  in a native configuration. Briefly,

  • We find that for most of these micro-benchmarks and Redis tests, vSphere delivered near-native performance with generally less than 5% overhead.
  • Running an application in a Docker container in a vSphere VM has very similar overhead of running containers on a native OS (directly on a physical server).

Next, we present the configuration and benchmark details as well as the performance results.

Deployment Scenarios

We compare four different scenarios as illustrated below:

  • Native: Linux OS running directly on hardware (Ubuntu, CentOS)
  • vSphere VM: Upcoming release of vSphere with the same guest OS as native
  • Native-Docker: Docker version 1.2 running on a native OS
  • VM-Docker: Docker version 1.2 running in guest VM on a vSphere host

In each configuration all the power management features are disabled in the BIOS and Ubuntu OS.

Test Scenarios

Figure 1: Different test scenarios


For this study, we used the micro-benchmarks listed below and also simulated a real-world use case.

–   Micro-benchmarks:

  • LINPACK: This benchmark solves a dense system of linear equations. For large problem sizes it has a large working set and does mostly floating point operations.
  • STREAM: This benchmark measures memory bandwidth across various configurations.
  • FIO: This benchmark is used for I/O benchmarking for block devices and file systems.
  • Netperf: This benchmark is used to measure network performance.

Real-world workload:

  • Redis: In this experiment, many clients perform continuous requests to the Redis server (key-value datastore).

For all of the tests, we run multiple iterations and report the average of multiple runs.

Performance Results


LINPACK solves a dense system of linear equations (Ax=b), measures the amount of time it takes to factor and solve the system of N equations, converts that time into a performance rate, and tests the results for accuracy. We used an optimized version of the LINPACK benchmark binary based on the Intel Math Kernel Library (MKL).

Hardware: 4 socket Intel Xeon E5-4650 2.7GHz with 512GB RAM, 32 total cores, Hyper-Threading disabled
Software: Ubuntu 14.04.1 with Docker 1.2
VM configuration: 32 vCPU VM with 45K and 65K problem sizes


Figure 2: LINPACK performance for different test scenarios

We disabled HT for this run as recommended by the benchmark guidelines to get the best peak performance. For the 45K problem size, the benchmark consumed about 16GB memory. All memory was backed by transparent large pages. For VM results, large pages were used both in the guest (transparent large pages) and at the hypervisor level (default for vSphere hypervisor). There was 1-2% run-to-run variation for the 45K problem size. For 65K size, 33.8GB memory was consumed and there was less than 1% variation.

As shown in Figure 2, there is almost negligible virtualization overhead in the 45K problem size. For a bigger problem size, there is some inherent hardware virtualization overhead due to nested page table walk. This results in the 5% drop in performance observed in the VM case. There is no additional overhead of running the application in a Docker container in a VM compared to running the application directly in the VM.


We used a NUMA-aware  STREAM benchmark, which is the classical STREAM benchmark extended to take advantage of NUMA systems. This benchmark measures the memory bandwidth across four different operations: Copy, Scale, Add, and Triad.

Hardware: 4 socket Intel Xeon E5-4650 2.7GHz with 512GB RAM, 32 total cores, HT enabled
Software: Ubuntu 14.04.1 with Docker 1.2
VM configuration: 64 vCPU VM (Hyper-Threading ON)


Figure 3: STREAM performance for different test scenarios

We used an array size of 2 billion, which used about 45GB of memory. We ran the benchmark with 64 threads both in the native and virtual cases. As shown in Figure 3, the VM added about 2-3% overhead across all four operations. The small 1-2% overhead of using a Docker container on a native platform is probably in the noise margin.


We used Flexible I/O (FIO) tool version 2.1.3 to compare the storage performance for the native and virtual configurations, with Docker containers running in both. We created a 10GB file in a 400GB local SSD drive and used direct I/O for all our tests so that there were no effects of buffer caching inside the OS. We used a 4k I/O size and tested three different I/O profiles: random 100% read, random 100% write, and a mixed case with random 70% read and 30% write. For the 100% random read and write tests, we selected 8 threads and an I/O depth of 16, whereas for the mixed test, we select an I/O depth of 32 and 8 threads. We use the taskset to set the CPU affinity on FIO threads in all configurations. All the details of the experimental setup are given below:

Hardware: 2 socket Intel Xeon E5-2660 2.2GHz with 392GB RAM, 16 total cores, Hyper-Threading enabled
Guest: 32-vCPU  14.04.1 Ubuntu 64-bit server with 256GB RAM, with a separate ext4 disk in the guest (on VMFS5 in vSphere run)
Benchmark:  FIO, Direct I/O, 10GB file
I/O Profile:  4k I/O, Random Read/Write: depth 16, jobs 8, Mixed: depth 32, jobs 8


Figure 4: FIO benchmark performance for different test scenarios

The figure above shows the normalized maximum IOPS achieved for different configurations and different I/O profiles. For random read in a VM, we see that there is about 2% reduction in maximum achievable IOPS when compared to the native case. However, for the random write and mixed tests, we observed almost the same performance (within the noise margin) compared to the native configuration.


Netperf is used to measure throughput and latency of networking operations. All the details of the experimental setup are given below:

Hardware (Server): 4 socket Intel Xeon E5-4650 2.7GHz with 512GB RAM, 32 total cores, Hyper-Threading disabled
Hardware (Client): 2 socket Intel Xeon X5570 2.93GHz with 64GB RAM, 8 cores total, Hyper-Threading disabled
Networking hardware: Broadcom Corporation NetXtreme II BCM57810
Software on server and Client: Ubuntu 14.04.1 with Docker 1.2
VM configuration: 2 vCPU VM with 4GB RAM

The server machine for Native is configured to have only 2 CPUs online for fair comparison with a 2-vCPU VM. The client machine is also configured to have 2 CPUs online to reduce variability. We tested four configurations: directly on the physical hardware (Native), in a Docker container (Native-Docker), in a virtual machine (VM), and in a Docker container inside a VM (VM-Docker). For the two Docker deployment scenarios, we also studied the effect of using host networking as opposed to the Docker bridge mode (default operating mode), resulting in two additional configurations (Native-Docker-HostNet and VM-Docker-HostNet) making total six configurations.

We used TCP_STREAM and TCP_RR tests to measure the throughput and round-trip network latency between the server machine and the client machine using a direct 10Gbps Ethernet link between two NICs. We used standard network tuning like TCP window scaling and setting socket buffer sizes for the throughput tests.


Figure 5: Netperf Recieve performance for different test scenarios


Figure 6: Netperf transmit performance for different test scenarios

Figure 5 and Figure 6 shows the unidirectional throughput over a single TCP connection with standard 1500 byte MTU for both transmit and receive TCP_STREAM cases (We used multiple Streams in VM-Docker* transmit case to reduce the variability in runs due to Docker bridge overhead and get predictable results). Throughput numbers for all configurations are identical and equal to the maximum possible 9.40Gbps on a 10GbE NIC.


Figure 7: Netperf TCP_RR performance for different test scenarios (Lower is better)

For the latency tests, we used the latency sensitivity feature introduced in vSphere5.5 and applied the best practices for tuning latency in a VM as mentioned in this white paper. As shown in Figure 7, latency in a VM with VMXNET3 device is only 15 microseconds more than in the native case because of the hypervisor networking stack. If users wish to reduce the latency even further for extremely latency- sensitive workloads, pass-through mode or SR-IOV can be configured to allow the guest VM to bypass the hypervisor network stack. This configuration can achieve similar round-trip latency to native, as shown in Figure 8. The Native-Docker and VM-Docker configuration adds about 9-10 microseconds of overhead due to the Docker bridge NAT function. A Docker container (running natively or in a VM) when configured to use host networking achieves similar latencies compared to the latencies observed when not running the workload in a container (native or a VM).


Figure 8: Netperf TCP_RR performance for different test scenarios (VMs in pass-through mode)


We also wanted to take a look at how Docker in a virtualized environment performs with real world applications. We chose Redis because: (1) it is a very popular application in the Docker space (based on the number of pulls of the Redis image from the official Docker registry); and (2) it is very demanding on several subsystems at once (CPU, memory, network), which makes it very effective as a whole system benchmark.

Our test-bed comprised two hosts connected by a 10GbE network. One of the hosts ran the Redis server in different configurations as mentioned in the netperf section. The other host ran the standard Redis benchmark program, redis-benchmark, in a VM.

The details about the hardware and software used in the experiments are the following:

Hardware: HP ProLiant DL380e Gen8 2 socket Intel Xeon E5-2470 2.3GHz with 96GB RAM, 16 total cores, Hyper-Threading enabled
Guest OS: CentOS 7
VM: 16 vCPU, 93GB RAM
Application: Redis 2.8.13
Benchmark: redis-benchmark, 1000 clients, pipeline: 1 request, operations: SET 1 Byte
Software configuration: Redis thread pinned to CPU 0 and network interrupts pinned to CPU 1

Since Redis is a single-threaded application, we decided to pin it to one of the CPUs and pin the network interrupts to an adjacent CPU in order to maximize cache locality and avoid cross-NUMA node memory access.  The workload we used consists of 1000 clients with a pipeline of 1 outstanding request setting a 1 byte value with a randomly generated key in a space of 100 billion keys.  This workload is highly stressful to the system resources because: (1) every operation results in a memory allocation; (2) the payload size is as small as it gets, resulting in very large number of small network packets; (3) as a consequence of (2), the frequency of operations is extremely high, resulting in complete saturation of the CPU running Redis and a high load on the CPU handling the network interrupts.

We ran five experiments for each of the above-mentioned configurations, and we measured the average throughput (operations per second) achieved during each run.  The results of these experiments are summarized in the following chart.


Figure 9: Redis performance for different test scenarios

The results are reported as a ratio with respect to native of the mean throughput over the 5 runs (error bars show the range of variability over those runs).

Redis running in a VM has slightly lower performance than on a native OS because of the network virtualization overhead introduced by the hypervisor. When Redis is run in a Docker container on native, the throughput is significantly lower than native because of the overhead introduced by the Docker bridge NAT function. In the VM-Docker case, the performance drop compared to the Native-Docker case is almost exactly the same small amount as in the VM-Native comparison, again because of the network virtualization overhead.  However, when Docker runs using host networking instead of its own internal bridge, near-native performance is observed for both the Docker on native hardware and Docker in VM cases, reaching 98% and 96% of the maximum throughput respectively.

Based on the above results, we can conclude that virtualization introduces only a 2% to 4% performance penalty.  This makes it possible to run applications like Redis in a Docker container inside a VM and retain all the virtualization advantages (security and performance isolation, management infrastructure, and more) while paying only a small price in terms of performance.


In this blog, we showed that in addition to the well-known security, isolation, and manageability advantages of virtualization, running an application in a Docker container in a vSphere VM adds very little performance overhead compared to running the application in a Docker container on a native OS. Furthermore, we found that a container in a VM delivers near native performance for Redis and most of the micro-benchmark tests we ran.

In this post, we focused on the performance of running a single instance of an application in a container, VM, or native OS. We are currently exploring scale-out applications and the performance implications of deploying them on various combinations of containers, VMs, and native operating systems.  The results will be covered in the next installment of this series. Stay tuned!


Power Management and Performance in ESXi 5.1

Powering and cooling are a substantial portion of datacenter costs. Ideally, we could minimize these costs by optimizing the datacenter’s energy consumption without impacting performance. The Host Power Management feature, which has been enabled by default since ESXi 5.0, allows hosts to reduce power consumption while boosting energy efficiency by putting processors into a low-power state when not fully utilized.

Power management can be controlled by the either the BIOS or the operating system. In the BIOS, manufacturers provide several types of Host Power Management policies. Although they vary by vendor, most include “Performance,” which does not use any power saving techniques, “Balanced,” which claims to increase energy efficiency with minimal or no impact to performance, and “OS Controlled,” which passes power management control to the operating system. The “Balanced” policy is variably known as “Performance per Watt,” “Dynamic” and other labels; consult your vendor for details. If “OS Controlled” is enabled in the BIOS, ESXi will manage power using one of the policies “High performance,” “Balanced,” “Low power,” or “Custom.” We chose to study Balanced because it is the default setting.

But can the Balanced setting, whether controlled by the BIOS or ESXi, reduce performance relative to the Performance setting? We have received reports from customers who have had performance problems while using the BIOS-controlled Balanced setting. Without knowing the effect of Balanced on performance and energy efficiency, when performance is at a premium users might select the Performance policy to play it safe. To answer this question we tested the impact of power management policies on performance and energy efficiency using VMmark 2.5.

VMmark 2.5 is a multi-host virtualization benchmark that uses varied application workloads as well as common datacenter operations to model the demands of the datacenter. VMs running diverse application workloads are grouped into units of load called tiles. For more details, see the VMmark 2.5 overview.

We tested three policies: the BIOS-controlled Performance setting, which uses no power management techniques, the ESXi-controlled Balanced setting (with the BIOS set to OS-Controlled mode), and the BIOS-controlled Balanced setting. The ESXi Balanced and BIOS-controlled Balanced settings cut power by reducing processor frequency and voltage among other power saving techniques.

We found that the ESXi Balanced setting did an excellent job of preserving performance, with no measurable performance impact at all levels of load. Not only was performance on par with expectations, but it did so while producing consistent improvements in energy efficiency, even while idle. By comparison, the BIOS Balanced setting aggressively saved power but created higher latencies and reduced performance. The following results detail our findings.

Testing Methodology
All tests were conducted on a four-node cluster running VMware vSphere 5.1. We compared performance and energy efficiency of VMmark between three power management policies: Performance, the ESXi-controlled Balanced setting, and the BIOS-controlled Balanced setting, also known as “Performance per Watt (Dell Active Power Controller).”

Systems Under Test: Four Dell PowerEdge R620 servers
CPUs (per server): One Eight-Core Intel® Xeon® E5-2665 @ 2.4 GHz, Hyper-Threading enabled
Memory (per server): 96GB DDR3 ECC @ 1067 MHz
Host Bus Adapter: Two QLogic QLE2562, Dual Port 8Gb Fibre Channel to PCI Express
Network Controller: One Intel Gigabit Quad Port I350 Adapter
Hypervisor: VMware ESXi 5.1.0
Storage Array: EMC VNX5700
62 Enterprise Flash Drives (SSDs), RAID 0, grouped as 3 x 8 SSD LUNs, 7 x 5 SSD LUNs, and 1 x 3 SSD LUN
Virtualization Management: VMware vCenter Server 5.1.0
VMmark version: 2.5
Power Meters: Three Yokogawa WT210

To determine the maximum VMmark load supported for each power management setting, we increased the number of VMmark tiles until the cluster reached saturation, which is defined as the largest number of tiles that still meet Quality of Service (QoS) requirements. All data points are the mean of three tests in each configuration and VMmark scores are normalized to the BIOS Balanced one-tile score.

Effects of Power Management on VMmark 2.5 score

The VMmark scores were equivalent between the Performance setting and the ESXi Balanced setting with less than a 1% difference at all load levels. However, running on the BIOS Balanced setting reduced the VMmark scores an average of 15%. On the BIOS Balanced setting, the environment was no longer able to support nine tiles and, even at low loads, on average, 31% of runs failed QoS requirements; only passing runs are pictured above.

We also compared the improvements in energy efficiency of the two Balanced settings against the Performance setting. The Performance per Kilowatt metric, which is new to VMmark 2.5, models energy efficiency as VMmark score per kilowatt of power consumed. More efficient results will have a higher Performance per Kilowatt.

Effects of Power Management on Energy Efficiency

Two trends are visible in this figure. As expected, the Performance setting showed the lowest energy efficiency. At every load level, ESXi Balanced was about 3% more energy efficient than the Performance setting, despite the fact that it delivered an equivalent score to Performance. The BIOS Balanced setting had the greatest energy efficiency, 20% average improvement over Performance.

Second, increase in load is correlated with greater energy efficiency. As the CPUs become busier, throughput increases at a faster rate than the required power. This can be understood by noting that an idle server will still consume power, but with no work to show for it. A highly utilized server is typically the most energy efficient per request completed, which is confirmed in our results. Higher energy efficiency creates cost savings in host energy consumption and in cooling costs.

The bursty nature of most environments leads them to sometimes idle, so we also measured each host’s idle power consumption. The Performance setting showed an average of 128 watts per host, while ESXi Balanced and BIOS Balanced consumed 85 watts per host. Although the Performance and ESXi Balanced settings performed very similarly under load, hosts using ESXi Balanced and BIOS Balanced power management consumed 33% less power while idle.

VMmark 2.5 scores are based on application and infrastructure workload throughput, while application latency reflects Quality of Service. For the Mail Server, Olio, and DVD Store 2 workloads, latency is defined as the application’s response time. We wanted to see how power management policies affected application latency as opposed to the VMmark score. All latencies are normalized to the lowest results.

Effects of Power Management on VMmark 2.5 Latencies

Whereas the Performance and ESXi Balanced latencies tracked closely, BIOS Balanced latencies were significantly higher at all load levels. Furthermore, latencies were unpredictable even at low load levels, and for this reason, 31% of runs between one and eight tiles failed; these runs are omitted from the figure above. For example, half of the BIOS Balanced runs did not pass QoS requirements at four tiles. These higher latencies were the result of aggressive power saving by the BIOS Balanced policy.

Our tests showed that ESXi’s Balanced power management policy didn’t affect throughput or latency compared to the Performance policy, but did improve energy efficiency by 3%. While the BIOS-controlled Balanced policy improved power efficiency by an average of 20% over Performance, it was so aggressive in cutting power that it often caused VMmark to fail QoS requirements.

Overall, the BIOS controlled Balanced policy produced substantial efficiency gains but with unpredictable performance, failed runs, and reduced performance at all load levels. This policy may still be suitable for some workloads which can tolerate this unpredictability, but should be used with caution. On the other hand, the ESXi Balanced policy produced modest efficiency gains while doing an excellent job protecting performance across all load levels. These findings make us confident that the ESXi Balanced policy is a good choice for most types of virtualized applications.

Exploring FAST Cache Performance Using VMmark 2.1.1

A system’s performance is often limited by the access time of its hard disk drive (HDD). Solid-state drives (SSDs), also known as Enterprise Flash Drives (EFDs), tout a superior performance profile to HDDs. In our previous comparison of EFD and HDD technologies using VMmark 2.1, we showed that EFD reads were on average four times faster than HDD reads, while EFD and HDD write speeds were comparable. However, EFDs are more costly per gigabyte.

Many vendors have attempted to address this issue using tiered storage technologies. Here, we tested the performance benefits of EMC’s FAST Cache storage array feature, which merges the strengths of both technologies. FAST Cache is an EFD-based read/write storage cache that supplements the array’s DRAM cache by giving frequently accessed data priority on the high performing EFDs. We used VMmark 2, a multi-host virtualization benchmark, to quantify the performance benefits of FAST Cache. For more details, see the overview, release notes for VMmark 2.1, and release notes for 2.1.1. VMmark 2 is an ideal tool to test FAST Cache performance for virtualized datacenters in that its varied workloads and bursty I/O patterns model the demands of the datacenter. We found that FAST Cache produced remarkable improvements in datacenter capacity and storage access latencies. With the addition of FAST Cache, the system could support twice as much load while still meeting QoS requirements.

FAST Cache
FAST Cache is a feature of EMC’s storage systems that tracks frequently accessed data on disk, promotes the data into an array-wide EFD cache to take advantage of Flash I/O access speeds, then writes it back to disk when the data is superseded in importance. FAST Cache optimizes the use of EFD storage. In most workloads only a small percentage of data will be frequently accessed. This is referred to as the ‘working set.’ An EFD-based cache allows the data in the working set to take advantage of the performance characteristics of EFDs while the rest of the data stays on lower-cost HDDs. Relevant data is rapidly promoted into the cache in increments of 64 KB pages, and a least-recently-used algorithm is used to decide which data to write back to disk.

The benefit achieved with FAST Cache depends on the workload’s I/O profile. As with most caches, FAST Cache will show the most benefit for I/O with a high locality of reference, such as database indices and reference tables. FAST Cache will be least beneficial to workloads with sequential I/O patterns like log files or large I/O size access because these may not access the same 64 KB block multiple times and the FAST Cache would never become populated.

Systems Under Test: Four Dell PowerEdge R310 Servers
CPUs (per server): One Quad-Core Intel® Xeon® X3460 @ 2.8 GHz, Hyper-Threading enabled
Memory (per server): 32 GB DDR3 ECC @ 800 MHz
Storage Array: EMC VNX5500
FAST Cache configurations:
366 GB FAST Cache, 8 EFDs, RAID 1
92 GB FAST Cache, 2 EFDs, RAID 1
FAST Cache disabled
LUN configurations:
20 HDDs, 10K RPM, grouped into 3 LUNs of 8, 8, and 4 HDDs each
11 HDDs, 10K RPM, grouped into 3 LUNs of 4, 4, and 3 HDDs each
Hypervisor: ESXi 5.0.0
Virtualization Management: VMware vCenter Server 5.0
VMmark version: 2.1.1

We used VMmark 2 to investigate several different factors relating to FAST Cache. We wanted to measure the performance benefit afforded by adding FAST Cache into a VMmark 2 environment and we wanted to observe how the performance benefit of FAST Cache would scale as we changed the size of the cache. We tested with FAST Cache disabled and with two different FAST Cache sizes which were made from two EFDs and eight EFDs in RAID 1, creating a cache of 92 GB and 366 GB usable space, respectively. FAST Cache was configured according to best practices to ensure FAST Cache performance was not limited by array bus bandwidth. After the FAST Cache was created, it was warmed up by repeating VMmark 2 runs until scores showed less than 3% variability between runs.

We also wanted to examine whether FAST Cache could reduce the hardware requirements of our tests. As processors and other system hardware components have increased in capacity and speed, there has been greater and greater pressure for corresponding increasing performance from storage. RAID groups of HDDs have been one answer to these increasing performance demands, as RAID arrays provide performance and reliability benefits over individual disks. In typical RAID configurations, performance increases nearly linearly as disks are added to the RAID group. However, adding disks in order to increase storage access speed can result in underutilization of HDD space, which becomes far greater than required. FAST Cache should allow us to reduce the number of HDDs we require for RAID performance benefits, also reducing the cluster’s total power, cooling and space requirements, which results in lower cost. FAST Cache services the bulk of the workloads’ I/O operations at high speeds, so it is acceptable for us to service the remainder of operations at lower speeds and use only as many HDDs as needed for storage capacity rather than performance.

To test whether an environment with FAST Cache and a reduced number of disks could perform as well as an environment without FAST Cache, but with a larger number of disks, we tested performance with two different disk configurations. Workloads were tested on a set of 20 HDDs and then on a set of 11 HDDs, in both cases grouped into three LUNs. Each LUN was in a distinct RAID 0 group. Due to the performance characteristics of RAID 0, we expected the 20 HDD configuration to have better performance to than the 11 HDD configuration. The placement of workloads onto LUNs was meant to model a naïve environment with nonoptimal storage setup. Two LUNs held workload tile data, and the third smaller LUN served as the destination for VM Deploy and Storage vMotion workloads. The first LUN held VMs from the first and third tiles, and the second LUN held VMs from the second and fourth tiles. Running VMmark 2 with more than one tile per LUN was atypical of our best practices for the benchmark. It created a severe bottleneck for the disk, which was meant to simulate the types of storage performance issues we sometimes see in customer environments.

All VMmark 2 tests were conducted on a cluster of four identically configured entry-level Dell Power Edge R310 servers running ESXi 5.0. All components in the environment besides FAST Cache and number of HDDs remained unchanged during testing.

To characterize cluster performance at multiple load levels, we increased the number of tiles until the cluster reached saturation, defined as when the run failed to meet Quality of Service (QoS) requirements. Scaling out the number of tiles until saturation allows us to determine the maximum VMmark 2 load the cluster could support and to compare performance at each level of load for each cache and storage configuration. All data points are the mean of three tests in each configuration. Scaling data was generated by normalizing every score to the lowest passing score, which was 1 tile with FAST Cache disabled on 20 HDDs.

VMmark 2.1.1 Scaling With and Without FAST Cache

With FAST Cache disabled, the 20 HDD LUNs reached saturation at 2 tiles, and the 11 HDD LUNs were unable to support even 1 tile of load. Because all VMs for each tile were placed on the same LUN, a 1 tile run used one LUN, consisting of only four out of 11 HDDs or eight out of 20 HDDs. 4 HDDs were insufficient to provide the required QoS for even 1 tile. When FAST Cache was enabled, the 11 HDD and 20 HDD configurations supported 4 tiles. This is a remarkable improvement; with the addition of FAST Cache, the system could support twice as much load while still meeting QoS requirements. Even at lower load levels, the equivalent system with FAST Cache was allowing greater throughput and showed resulting increases in the VMmark score of 26% at 1 tile and 31% at 2 tiles. With FAST Cache enabled, the configuration with 11 HDDs performed equivalently to one with 20 HDDs until the system approached saturation.

With FAST Cache enabled, the system supported twice as much load on almost half as many disks. The results show that an environment with a 92 GB FAST Cache was able to greatly outperform a HDD-only environment that contains 82% more disks. At 4 tiles with FAST Cache enabled, the cluster’s CPU utilization was approaching saturation, reaching an average of 84%, but was not yet bottlenecked on storage.

In our tests, performance did not scale up very much as we increased FAST Cache size from 92 GB to 366 GB and the number of HDDs from 11 to 20.

VMmark 2.1.1 Scaling with FAST Cache

We can see that all configurations scaled very similarly from 1 to 3 tiles with only minor differences appearing, primarily between the 92 GB FAST Cache and 366 GB FAST Cache. Only at the highest load level did performance begin to diverge. Predictably, the largest cache configurations show the best performance at 4 tiles, followed by the smaller cache configurations. To determine whether this performance falloff was directly attributable to the cache size and number of HDDs, we needed to know whether FAST Cache was performing to capacity.

Below are the FAST Cache and DRAM cache hit percentages for read and write operations at the 4 tile load. On average, our VMmark testing had I/O operations of 24% reads and 76% writes.

Total Cache Hits at 4 TilesRead and Write Cache Hits at 4 tiles
Click to Enlarge

With the 366 GB FAST Cache, nearly all reads and writes were hitting either the DRAM or FAST Cache. In these cases, the number of backing disks did not affect the score because disks were rarely being accessed. At this cache size, all frequently accessed data fit into the FAST Cache. However, with the 92 GB FAST Cache, the cache hit percentage decreased to 96.5% and 92.1% for the 11 HDD and 20 HDD configurations, respectively. This indicated that the entire working set could no longer fit into the 92 GB FAST Cache. The 11 HDD configuration began to show decreased performance relative to 20 HDDs, because although only 3.5% of total I/O operations were going to disk, the increase in disk latency was large enough to reduce throughput and affect VMmark score. Despite this, a FAST Cache of 92 GB was still sufficient to provide us with VMmark performance that met QoS requirements. The higher read hit percentages in the 11 HDD configuration reflected this reduced throughput. Lower throughput resulted in a smaller working set and an accordingly higher read hit percentage.

Overall, FAST Cache did an excellent job of identifying the working set. Although only 8% of the 1.09 TB dataset could fit in the 92 GB cache at any one time, at least 92% of I/O requests were hitting the cache.

Scaling FAST Cache gave us a sense of the working set size of the VMmark benchmark. As performance with the 92 GB FAST Cache demonstrated a knee at 3 tiles, this suggests the working set size at 3 tiles is less than 92 GB and the working set size at 4 tiles is slightly greater than 92 GB. Knowing the approximate working set size per tile would allow us to select the minimum FAST Cache size required if we wanted our entire working set to fit into the FAST Cache, even if we scaled the benchmark to an arbitrary number of tiles in a different cluster.

The results below show that I/O operations per second and I/O latency were affected by our environment characteristics.

I/O Latency at 4 Tiles

The variability in read latency is clearly affected by both FAST Cache size and number of backing HDDs. Latency is highest with only 11 HDDs and the smaller FAST Cache, and decreases as we add HDDs. Latency decreases even more with the larger FAST Cache size as nearly all reads hit the cache. Write latency, however, is relatively constant across configurations, which is as expected because in each configuration nearly all writes are being served by either the DRAM cache or FAST Cache.

It’s clear that we can replace a large number of HDDs with a much smaller number of EFDs and get similar or improved performance results. An array with 11 HDDs and FAST Cache outperformed an array with 20 HDDs without FAST Cache. FAST Cache handles the workloads’ performance requirements so that we need only to supply the HDDs necessary for their storage space, rather than performance capabilities. This allows us to reduce the number of HDDs and their associated power, space, cooling, and cost.

Tiered storage solutions like FAST Cache make excellent use of EFDs, even to the extent that 92% or more of our I/O operations are benefitting from Flash-level latencies while the EFD storage itself holds only 8% of our total data. The increased VMmark scores demonstrate the ability of FAST Cache to pinpoint the most active data remarkably well, and, even in a bursty environment, show incredible improvements in I/O latency and in the load that a cluster can support.  Our testing showed FAST Cache provides Flash-level storage access speeds to the data that needs it most, reduces storage bottlenecking and increases supported load, making FAST Cache a highly valuable addition to the datacenter.