Home > Blogs > VMware VROOM! Blog > Tag Archives: vSphere 6.0

Tag Archives: vSphere 6.0

Running Transactional Workloads Using Docker Containers on vSphere 6.0

In a series of blogs, we showed that not only can Docker containers seamlessly run inside vSphere 6.0 VMs, but both micro-benchmarks and popular workloads in such configurations perform as well as, and in some cases better than, in native Docker configurations.

See the following blog posts for our past findings:

In this blog, we study a transactional database workload and present the results on how Docker containers perform in a VM when we scale out the number of database instances. To do this experiment, we use DVD Store 2.1, which is an OLTP benchmark that supports and stresses many different back-end databases including Microsoft SQL Server, Oracle Database, MySQL, and PostgreSQL. This benchmark is open source and the latest version 2.1 is available here. It is a 3-tier application with a Web server, an application server, and a backend database server. The benchmark simulates a DVD store, where customers log in, browse, and order DVD products. The tool is designed to utilize a number of advanced database features including transactions, stored procedures, triggers, and referential integrity. The main transactions are (1) add new customers, (2) log in customers, (3) browse DVDs, (4) enter purchase orders, and (5) re-order stock. The client driver is written in C# and is usually run from Windows; however, the client can be run on Linux using the Mono framework. The primary performance metric of the benchmark is orders per minute (OPM).

In our experiments, we used a PostgreSQL database with an Apache Web server, and the application logic was implemented in PHP. In all tests we ran 16 instances of DVD Store, where each instance comprises all 3-tiers. We found that, due to better scheduling, running Docker in a VM in a scale-out scenario can provide better throughput than running a Docker container in a native system.

Next, we present the configurations, benchmarks, detailed setup, and the performance results.

Deployment Scenarios

We compare four different scenarios, as illustrated below:

–   Native: Linux OS running directly on hardware (Ubuntu 14.04.1)

–   vSphere VM: vSphere 6.0 with VMFS5, in 8 VMs, each with the same guest OS as native

–   Native-Docker: Docker 1.5 running on a native OS (Ubuntu 14.04.1)

–   VM-Docker: Docker 1.5 running in each of 8 VMs on a vSphere  host

In each configuration, all of the power management features were disabled in the BIOS.

Hardware/Software/Workload Configuration

Figure 1 shows the hardware setup diagram for the server host below. We used Ubuntu 14.04.1 with Docker 1.5 for all our experiments.  While running Docker configuration, we use bridged networking and host volumes for storing the database.


Figure 1. Hardware/software configuration

Performance Results

We ran 16 instances of DVD Store, where each instance was running an Apache web server, PHP application logic, and a PostgreSQL database. In the Docker cases, we ran one instance of DVD Store per Docker container. In the non-Docker cases, we used the Virtual Hosts functionality of Apache to run many instances of a Web server listening on different ports. We also used the PostgreSQL command line to create different instances of the database server listening on different ports. In the VM-based experiments, we partitioned the host hardware between 8 VMs, where each VM ran 2 DVD Store instances. The 8 VMs exactly committed the CPUs, and under-committed the memory.

The four configurations for our experiments are listed below.


  • Native-16S: 16 instances of DVD Store running natively (16 separate instances of Apache 2 using virtual hosts and 16 separate instances of PostgreSQL database)
  • Native-Docker-16S: 16 Docker containers running on a native machine with each running one instance of DVD Store.
  • VM-8VMs-16S: Eight 4-vCPU VMs each running 2 DVD Store instances
  • VM-Docker-8VMs-16S: Eight 4-vCPU VMs each running 2 Docker containers,  where each Docker container is running one instance of DVD Store

We ran the DVD Store benchmark for the 4 configurations using 16 client drivers, where each driver process was running 4 threads. The results for these 4 configurations are shown in the figure below.


Figure 2. DVD Store performance for different configurations

In the chart above, the y-axis shows the aggregate DVD Store performance metric orders per minute (OPM) for all 16 instances. We have normalized the order per minute results with respect to the native configuration where we saw about 126k orders per minute. From the chart, we see that, the VM configurations achieve higher throughput compared to the corresponding native configurations. As in the case of prior blogs, this is due to better NUMA-aware scheduling in vSphere.  We can also see that running Docker containers either natively or in VMs adds little overhead (2-4%).

To find out why the native configurations were not doing better, we pinned half of the Docker containers to one NUMA node and half to the other. The DVD Store aggregate OPM improved as a result and as expected, we were seeing slightly better than the VM configuration.  However, manually pinning processes to cores or sockets is usually not a recommended practice because it is error-prone and can, in general, lead to unexpected or suboptimal results.


In this blog, we showed that running a PostgreSQL transactional database in a Docker container in a vSphere VM adds very little performance cost compared to running directly in the VM. We also find that running Docker containers in a set of 8 VMs achieves slightly better throughput than running the same Docker containers natively with an out-of-the-box configuration. This is a further proof that VMs and Docker containers are truly “better together.”



Virtualized Storage Performance: RAID Groups versus Storage pools

RAID, a redundant array of independent disks, has traditionally been the foundation of enterprise storage. Grouping multiple disks into one logical unit can vastly increase the availability and performance of storage by protecting against disk failure, allowing greater I/O parallelism, and pooling capacity. Storage pools similarly increase the capacity and performance of storage, but are easier to configure and manage than RAID groups.

RAID groups have traditionally been regarded as offering better and more predictable performance than storage pools. Although both technologies were developed for magnetic hard disk drives (HDDs), solid-state drives (SSDs), which use flash memory, have become prevalent. Virtualized environments are also common and tend to create highly randomized I/O given the fact that multiple workloads are run simultaneously.

We set out to see how the performance of RAID group and storage pool provisioning methods compare in today’s virtualized environments.

First, let’s take a closer look at each storage provisioning type.

RAID Groups

A RAID group unifies a number of disks into one logical unit and distributes data across multiple drives. RAID groups can be configured with a particular protection level depending on the performance, capacity, and redundancy needs of the environment. LUNs are then allocated from the RAID group. RAID groups typically contain only identical drives, and the maximum number of disks in a RAID group varies by system model but is generally below fifty. Because drives typically have well defined performance characteristics, the overall RAID group performance can be calculated as the performance of all drives in the group minus the RAID overhead. To provide consistent performance, workloads with different I/O profiles (e.g., sequential vs. random I/O) or different performance needs should be physically isolated in different RAID groups so they do not share disks.

Storage Pools

Storage pools, or simply ‘pools’, are very similar to RAID groups in some ways. Implementation varies by vendor, but generally pools are made up of one or more private RAID groups, which are not visible to the user, or they are composed of user-configured RAID groups which are added manually to the pool. LUNs are then allocated from the pool. Storage pools can contain up to hundreds of drives, often all the drives in an array. As business needs grow, storage pools can be easily scaled up by adding drives or RAID groups and expanding LUN capacity. Storage pools can contain multiple types and sizes of drives and can spread workloads over more drives for a greater degree of parallelism.

Storage pools are usually required for array features like automated storage tiering, where faster SSDs can serve as a data cache among a larger group of HDDs, as well as other array-level data services like compression, deduplication, and thin provisioning. Because of their larger maximum size, storage pools, unlike RAID groups, can take advantage of vSphere 6 maximum LUN sizes of 64TB.

We used two benchmarks to compare the performance of RAID groups and storage pools: VMmark, which is a virtualization platform benchmark, and I/O Analyzer with Iometer, which is a storage microbenchmark.  VMmark is a multi-host virtualization benchmark that uses diverse application workloads as well as common platform level workloads to model the demands of the datacenter. VMs running a complete set of the application workloads are grouped into units of load called tiles. For more details, see the VMmark 2.5 overview. Iometer places high levels of load on the disk, but does not stress any other system resources. Together, these benchmarks give us both a ‘real-world’ and a more focused perspective on storage performance.

VMmark Testing

Array Configuration

Testing was conducted on an EMC VNX5800 block storage SAN with Fibre Channel. This was one of the many storage solutions which offered both RAID group and storage pool technologies. Disks were 200GB single-level cell (SLC) SSDs. Storage configuration followed array best practices, including balancing LUNs across Storage Processors and ensuring that RAID groups and LUNs did not span the array bus. One way to optimize SSD performance is to leave up to 50% of the SSD capacity unutilized, also known as overprovisioning. To follow this best practice, 50% of the RAID group or storage pool was not allocated to any LUN. Since overprovisioning SSDs can be an expensive proposition, we also tested the same configuration with 100% of the storage pool or RAID group allocated.

RAID Group Configuration

Four RAID 5 groups were used, each composed of 15 SSDs. RAID 5 was selected for its suitability for general purpose workloads. RAID 5 provides tolerance against a single disk failure. For best performance and capacity, RAID 5 groups should be sized to multiples of five or nine drives, so this group maintains a multiple of the preferred five-drive count. One LUN was created in each of the four RAID groups. The LUN was sized to either 50% of the RAID group (Best Practices) or 100% (Fully Allocated). For testing, the capacity of each LUN was fully utilized by VMmark virtual machines and randomized data.

RAID Group Configuration VMmark Storage Comparison        VMmark Storage Pool Configuration Storage Comparison

Storage Pool Configuration

A single RAID 5 Storage Pool containing all 60 SSDs was used. Four thick LUNs were allocated from the pool, meaning that all of the storage space was reserved on the volume. LUNs were equivalent in size and consumed a total of either 50% (Best Practices) or 100% (Fully Allocated) of the pool capacity.

Storage Layout

Most of the VMmark storage load was created by two types of virtual machines: database (DVD Store) and mail server (Microsoft Exchange). These virtual machines were isolated on two different LUNs. The remaining virtual machines were spread across the remaining two LUNs. That is, in the RAID group case, storage-heavy workloads were physically isolated in different RAID groups, but in the storage pool case, all workloads shared the same pool.

Systems Under Test: Two Dell PowerEdge R720 servers
Configuration Per Server:  
     Virtualization Platform: VMware vSphere 6.0. VMs used virtual hardware version 11 and current VMware Tools.
     CPUs: Two 12-core Intel® Xeon® E5-2697 v2 @ 2.7 GHz, Turbo Boost Enabled, up to 3.5 GHz, Hyper-Threading enabled.
     Memory: 256GB ECC DDR3 @ 1866MHz
     Host Bus Adapter: QLogic ISP2532 DualPort 8Gb Fibre Channel to PCI Express
     Network Controller: One Intel 82599EB dual-port 10 Gigabit PCIe Adapter, one Intel I350 Dual-Port Gigabit PCIe Adapter

Each configuration was tested at three different load points: 1 tile (the lowest load level), 7 tiles (an approximate mid-point), and 13 tiles, which was the maximum number of tiles that still met Quality of Service (QoS) requirements. All datapoints represent the mean of two tests of each configuration.

VMmark Results

RAID Group vs. Storage Pool Performance comparison using VMmark benchmark

Across all load levels tested, the VMmark performance score, which is a function of application throughput, was similar regardless of storage provisioning type. Neither the storage type used nor the capacity allocated affected throughput.

VMmark 2.5 performance scores are based on application and infrastructure workload throughput, while application latency reflects Quality of Service. For the Mail Server, Olio, and DVD Store 2 workloads, latency is defined as the application’s response time. We wanted to see how storage configuration affected application latency as opposed to the VMmark score. All latencies are normalized to the lowest 1-tile results.

Storage configuration did not affect VMmark application latencies.

Application Latency in VMmark Storage Comparison RAID Group vs Storage Pool

Lastly, we measured read and write I/O latencies: esxtop Average Guest MilliSec/Write and Average Guest MilliSec/Read. This is the round trip I/O latency as seen by the Guest operating system.

VMmark Storage Latency Storage Comparison RAID Group vs Storage Pool

No differences emerged in I/O latencies.

I/O Analyzer with Iometer Testing

In the second set of experiments, we wanted to see if we would find similar results while testing storage using a synthetic microbenchmark. I/O Analyzer is a tool which uses Iometer to drive load on a Linux-based virtual machine then collates the performance results. The benefit of using a microbenchmark like Iometer is that it places heavy load on just the storage subsystem, ensuring that no other subsystem is the bottleneck.


Testing used a VNX5800 array and RAID 5 level as in the prior configuration, but all storage configurations spanned 9 SSDs, also a preferred drive count. In contrast to the prior test, the storage pool or RAID group spanned an identical number of disks, so that the number of disks per LUN was the same in both configurations. Testing used nine disks per LUN to achieve greater load on each disk.

The LUN was sized to either 50% or 100% of the storage group. The LUN capacity was fully occupied with the I/O Analyzer worker VM and randomized data.  The I/O Analyzer Controller VM, which initiates the benchmark, was located on a separate array and host.

Storage Configuration Iometer with Storage Pool and RAID Group

Testing used one I/O Analyzer worker VM. One Iometer worker thread drove storage load. The size of the VM’s virtual disk determines the size of the active dataset, so a 100GB thick-provisioned virtual disk on VMFS-5 was chosen to maximize I/O to the disk and minimize caching. We tested at a medium load level using a plausible datacenter I/O profile, understanding, however, that any static I/O profile will be a broad generalization of real-life workloads.

Iometer Configuration

  • 1 vCPU, 2GB memory
  • 70% read, 30% write
  • 100% random I/O to model the “I/O blender effect” in a virtualized environment
  • 4KB block size
  • I/O aligned to sector boundaries
  • 64 outstanding I/O
  • 60 minute warm up period, 60 minute measurement period
Systems Under Test: One Dell PowerEdge R720 server
Configuration Per Server:  
     Virtualization Platform: VMware vSphere 6.0. Worker VM used the I/O Analyzer default virtual hardware version 7.
     CPUs: Two 12-core Intel® Xeon® E5-2697 v2 @ 2.7 GHz, Turbo Boost Enabled, up to 3.5 GHz, Hyper-Threading enabled.
     Memory: 256GB ECC DDR3 @ 1866MHz
     Host Bus Adapter: QLogic ISP2532 DualPort 8Gb Fibre Channel to PCI Express

Iometer results

Iometer Latency Results Storage Comparison RAID Group vs Storage PoolIometer Throughput Results Storage Comparison RAID Group vs Storage Pool

In Iometer testing, the storage pool showed slightly improved performance compared to the RAID group, and the amount of capacity allocated also did not affect performance.

In both our multi-workload and synthetic microbenchmark scenarios, we did not observe any performance penalty of choosing storage pools over RAID groups on an all-SSD array, even when disparate workloads shared the same storage pool. We also did not find any performance benefit at the application or I/O level from leaving unallocated capacity, or overprovisioning, SSD RAID groups or storage pools. Given the ease of management and feature-based benefits of storage pools, including automated storage tiering, compression, deduplication, and thin provisioning, storage pools are an excellent choice in today’s datacenters.

SQL Server VM Performance on VMware vSphere 6

Last October, I blogged about SQL Server performance with vSphere 5.5 using a four-socket Intel Xeon processor E7 based host.  Now that vSphere 6 is available, I’ve run an updated set of tests using this new release, on an even more powerful host, with Xeon E7 v2 processors.  A variety of virtual CPU (vCPU) and virtual machine (VM) quantities were tested to show that vSphere can handle hundreds of thousands of online transaction processing (OLTP) database operations per minute.

DVD Store 2.1, an open-source OLTP database stress tool, was the workload used to stress the VMs.  The first experiment in the paper was a generational performance comparison between the old and new setups; as you can see, there is a dramatic increase in throughput, even though the size of each VM has doubled from 8 vCPUs per VM to 16:

Generational performance improvement from old study to new study

There are also tests using CPU affinity to show the performance differences between physical cores and logical processors (Hyper-Threads), the benefit of “right-sizing” virtual machines, and measuring the impact of the advanced Latency Sensitivity setting. 

For more details and the test results, please download the whitepaper: Performance Characterization of Microsoft SQL Server on VMware vSphere 6.

VMware vSphere 6 and Oracle 12c Scalability Study: Scaling Monster Virtual Machines

vSphere 6 introduces the ability to run virtual machines (VMs) with up to 128 virtual CPUs (vCPUs) and 4TB of RAM. This doubles the number of vCPUs supported from the previous version and increases the amount of RAM by four times. This new capability provides the potential for customers to run larger workloads than ever before in a virtual machine.

A series of tests were run with a virtual machine hosting Oracle 12c database instances. The DVD Store 2.1 open-source transactional workload was used to measure the performance of a large “Monster” VM on vSphere 6. The Oracle 12c database VM was scaled from 15 vCPUs all the way up to 120 vCPUs, and the maximum achieved throughput was measured. The full results and test details have been published in a white paper – VMware vSphere 6 and Oracle 12c Scalability Study: Scaling Monster Virtual Machines.

A four-socket Intel Xeon E7-4890 v2 processor based server with 1TB of memory was used to host the virtual machine for the tests.  Each Xeon E7-4890 v2 processor has 15 cores / 30 threads with Hyper Threading enabled for a total of 60 cores / 120 threads for the system. The diagram below shows the basic test configuration.



In all tests Hyper-Threading was enabled on the server, but in configurations where 60 vCPUs or less are assigned to the VM, Hyper-Threads are not used by the VM. This is a result of the default scheduling policy where the preference is for vCPUs to be scheduled on one thread per core before using the second thread of any core. This first set of results, shown below, is focused on the tests that scale up to 60 vCPUs. These tests show the scaling for the virtual machine without the use of Hyper-Threads


While vSphere 6 supports up to 128 vCPUs per VM, these tests were limited to 120 vCPUs due to the number of threads available on the server. The largest VM configuration used both hardware execution threads (Hyper-Threads) on all the processor cores in order to reach 120 vCPUs. In this case, there is one vCPU per execution thread.

Hyper-Threading doubles the number of execution threads, but it does not double performance. In order to measure the scale-up performance of the 120-vCPU VM, a 60-vCPU VM was configured with CPU affinity so that it was limited to only two of the server’s four sockets. In this configuration the 60-vCPU VM has one vCPU per execution thread, which is the same as the 120-vCPU VM.  Configuring a 60-vCPU VM in this way makes it easy to see the scale up performance at 120 vCPUs on this server with hyper-threads enabled.

The results of the scale-up testing using the 60-vCPU VM configured with CPU affinity to only 2 sockets and the 120-vCPU VM using all four sockets showed approximately linear scaling, as shown in the graph below.


For full test details and more test results please see the white paper that has was recently published.

The new larger “Monster” VM support in vSphere 6 allows for virtual machines that can support larger workloads than ever before with excellent performance. These tests show that large virtual machines running on vSphere 6 can scale up as needed to meet extreme performance demands.


Improvements in Network I/O Control for vSphere 6

Network I/O Control (NetIOC) in VMware vSphere 6 has been enhanced to support a number of exciting new features such as bandwidth reservations. A new paper published by the Performance Engineering team shows the performance of these new features. The paper also explores the performance impact of the new NetIOC algorithm. Later tests show that NetIOC offers a powerful way to achieve network resource isolation at minimal cost, in terms of latency and CPU utilization.

You can read the paper here.


Virtualized Hadoop Performance with vSphere 6

A recently published whitepaper shows that not only can vSphere 6 keep up with newer high-performance servers, it thrives on their capabilities.

Two years ago, Hadoop benchmarks were run with vSphere 5.1 on a cluster of 32 dual-socket, quad-core servers. Very good performance was demonstrated, with the optimal virtualized configuration shown to be actually 2% faster than native for TeraSort (see the previous whitepaper).

These benchmarks were recently run on a cluster of the same size, but with ten-core processors, more disks and memory, dual 10GbE networking, and vSphere 6. The maximum dataset size was almost quadrupled to 30TB, to ensure that it is much bigger than the total memory in the cluster (hence qualifying the test as Big Data, by one definition).

The results, summarized in the chart below, show that the optimal virtualized configuration now delivers 12% better performance than native for TeraSort. The primary reason for this excellent performance is the ability of vSphere to map physical hardware resources to virtual hardware that is optimized for scale-out applications. The observed trend, as well as theory based on processor characteristics, indicates that the importance of being able to do this mapping correctly increases as processors become more powerful. The sub-optimal performance of one of the tests is due to the combination of very small VMs and how Hadoop does replication during data creation. On the other hand, small VMs are very advantageous for read-dominated applications, which are typically more common. Taken together with other best practices discussed in the paper, this information can be used to configure Hadoop clusters for the highest levels of performance. Despite all the hardware and software changes over the past two years, the optimal configuration was still found to be four VMs per dual-socket host.

elapsed_time_ratioPlease take a look at the whitepaper for more details on how these benchmarks were run and for analyses on why certain virtual configurations perform so well.

Scaling Out Redis Performance with Docker on vSphere 6.0

by Davide Bergamasco

In an earlier VROOM! post we discussed, among other things, the performance of the Redis in-memory key-value store in a Docker/vSphere environment. In that post we focused on a single instance of a Redis server subject to a more or less artificial workload with the goal of assessing the absolute performance of said instance under various deployment scenarios.

In this post we are taking a different point of view, which is maximizing the throughput of multiple Redis instances running on a “large” server under a more realistic workload. Why are we interested in this perspective?  Conceptually, Redis is an extremely simple application, being just a thin layer of code implementing a large hash table on top of system calls.  From the implementation standpoint, a single-threaded event loop services requests from the clients in a polling fashion. The problem with this design is that it is not suitable for “scaling up”; that is, improving performance by using multiple cores. Modern servers have many processing cores (up to 80) and possibly terabytes of memory.  However, Redis can only access that memory at the speed of a single core.

This problem can be solved by “scaling out” Redis; that is, by partitioning the server memory across multiple Redis instances and running each of those on a different core.  This can be achieved by using a set of load balancers to fragment the key space and distribute the load among the various instances. The diagram shown in Figure 1 illustrates this concept.

Figure 1. Redis Scale Out Setup

Figure 1. Redis Scale Out Setup

Host H3 runs the various Redis server instances (red boxes), while Host H2 runs two sets of load balancers:

  • The green boxes are the Redis load balancers, which partition the key space using a consistent hashing algorithm.  We leveraged the Twemproxy OSS project to implement the Redis load balancers.
  • The yellow boxes are TCP load balancers, which distribute the load across the Redis load balancers in a round robin fashion. We used the HAProxy OSS project to implement the TCP load balancers.

Finally, Host H1 runs the load generators (dark blue boxes); that is, the standard benchmark redis-benchmark.

Deployment Scenarios

We assessed the performance of this design across a set of deployment scenarios analogous to what we considered in the previous post. These are listed below and illustrated in Figure 2:

  • Native: Redis instances are run as 8 separate processes on the Linux OS running directly on Host H3 hardware.
  • VM: Redis instances are run inside 8, 2-vCPU VMs running on a pre-release build of vSphere 6.0.0 running on Host H3 hardware; the guest OS is the same as the Native scenario.
  • Native-Docker: Redis instances are run inside 8 Docker containers running on the Native OS.
  • VM-Docker: Redis instances are run inside Docker containers each running inside the same VMs as the VM scenario, with one container per VM.
Figure 2. Different deployment scenarios

Figure 2. Different deployment scenarios

Hardware/Software/Workload Configuration

The following are the details about the hardware, software, and workload used in the various experiments discussed in the next section:


  • HP ProLiant DL380e Gen8
  • CPU: 2 x Intel® Xeon® CPU E5-2470 0 @ 2.30GHz (16 cores, 32 hyper-threads total)
  • Memory: 96GB
  • Hardware configuration: Hyper-Threading ON, turbo-boost OFF, power policy: Static High (no power management)
  • Network: 10GbE
  • Storage: 8 x 500GB 15,000 RPM 6Gb SAS disks, HP H220 host bus adapter

Linux OS:

  • CentOS 7
  • Kernel 3.18.1 (CentOS 7 comes with 3.10.0, but we wanted to use the latest kernel available at the time of this writing)
  • Docker 1.2


  • VMware vSphere 6.0.0 (pre-release build)


  • 8 x 2-vCPU, 11GB (VM scenario)
  • Virtual NIC: vmxnet3
  • Virtual HBA: LSI-SAS


  • Redis 2.8.13
  • AOF persistency with “everysec” flush policy (every operation that mutates a key is logged into an Append Only File in order to enable data recovery after a crash; the buffer cache is flushed every second, so with this durability policy at most one second worth of data can be lost)


  • Keyspace: 250 million keys, value size 1 byte (this size has been chosen to prevent network or storage from becoming bottlenecks from the bandwidth perspective)
  • 8 redis-benchmark instances each simulating 100 clients with a pipeline depth of 30 requests
  • Operations mix: 75% GET, 25% SET 


We ran two sets of experiments for every scenario listed in the “Deployment Scenarios” section. The first set was meant to establish a baseline by having a single redis-benchmark instance generating requests directly against a single redis-server instance. The second set aimed at assessing the overall performance of the Redis scale-out system we presented earlier. The results of these two set of experiments are shown in Figure 3, where each bar represents the throughput in operations per second averaged over five trials, and error bars indicate the range of the measured values.

Figure 3. Results of the performance experiments (Y-axis represents throughput in 1,000 operations per second)

Figure 3. Results of the performance experiments (Y-axis represents throughput in 1,000 operations per second)

Nothing really surprising can be noticed looking at the results of the baseline experiments (labeled “1 Server – 1 Client” in Figure 3).  The Native scenario is obviously the fastest in terms of operations per second, followed by the Docker, VM, and the Docker-VM scenarios. This is expected as both virtualization and containerization add some overhead on top of the bare-metal performance.

Looking at the scale-out experiments (labeled “Scale-Out” in Figure 3), we see a surprisingly different picture. The VM scenario is now the fastest, followed by Docker-VM, while the Native and Docker scenarios come in as a somewhat distant third and fourth.  This unexpected result can be explained by looking at the Host H3 CPU activity during an experiment run.  In the Native and Docker scenarios, notice that the CPU load is spread over the 16 cores.  This means that even though only 8 threads are active (the 8 redis-server instances), the Linux scheduler is continuously migrating them.  This might result in a large number of cross-NUMA node memory accesses, which are substantially more expensive than same-NUMA node accesses. Also, irqbalance is spreading the network card interrupts across all the 16 cores, additionally contributing to the above phenomenon.

In the VM and Docker-VM scenarios, this does not occur because the ESXi scheduler goes to great lengths to keep both the memory and vCPUs of a VM on one NUMA node.  Also, with the PVSCSI virtual device, the virtual interrupts are always routed to the same vCPU(s) that initiated an I/O, and this minimizes interrupt migrations.

We tried to eliminate the cross-NUMA node memory activity in the Native scenario by pinning all the redis-server processes to the cores of the same CPU; that is, to the same NUMA node. We also disabled irqbalance and manually pinned the interrupt vectors to the same set of cores. As expected, with this ad-hoc configuration, the Native scenario was the fastest, reaching 3.408 million operations per second. Without any pinning, the VM result is only 4% slower than the optimized Native performance. (Notice that introducing artificial affinity between processes/interrupt vectors and cores is not a recommended practice as it is error-prone and can, in general, lead to unexpected or suboptimal results.)

Our initial experiments were conducted with the CentOS 7 stock kernel (3.10.0), which unfortunately is not particularly recent. We thought it was prudent to verify if the Linux scheduler had been improved to avoid the inter-NUMA node thread migrations in more recent kernel versions.  Hence, we re-ran all the experiments with the latest version (at the time of this writing, 3.18.1), but we didn’t notice any significant difference with respect to version 3.10.0.

We thought it would be interesting to look at the performance numbers in terms of speedup; that is, the ratio between the throughput of the scale-out system and the throughput of the baseline 1 Server – 1 Client setup. Figure 4 below shows the speedup for the four scenarios considered in this study.

Figure 4: Speedup (Y-axis represents speedup with respect to baseline)

Figure 4: Speedup (Y-axis represents speedup with respect to baseline)

The speedup essentially tells, in relative terms, how much the performance has improved by deploying 8 Redis instances on the same host instead of on a single one. If the system scaled linearly, it would have achieved a maximum theoretical speedup of 8.  In practice, this limit could not be achieved because of extra overheads introduced by the load-balancers and possible resource contention across the Redis instances running on host H3 (this host is almost running at saturation as the overall CPU utilization is consistently between 75% and 85% during the experiment’s execution). In any case, the scale-out system delivers a performance boost of at least 4x as compared to running a single Redis instance with exactly the same memory capacity.  The VM and Docker-VM scenarios achieve a substantially larger speedup because of the cross-NUMA memory access issue afflicting the Native and Docker scenarios.


The main results of this study are the following:

  1. VMs and Docker containers are truly better together. The Redis scale-out system, using out-of-the-box configuration settings, clearly achieves better performance in the Docker-VM scenario than in the Native or Docker scenarios. Even though its performance is not as high as in the VM scenario, the Docker-VM setup offers the same ease of use and deployment typical of the Docker scenario, at a substantially higher performance.
  2. Using VMs and Docker, we managed to scale out a Redis deployment and extracted a great deal of extra performance (up to 5.6x more) from a large server that would have otherwise been underutilized.