Docker Containers Performance in VMware vSphere

by Qasim Ali, Banit Agrawal, and Davide Bergamasco

“Containers without compromise.” This was one of the key messages at VMworld 2014 USA in San Francisco. It was presented in the opening keynote, and then the advantages of running Docker containers inside of virtual machines were discussed in detail in several breakout sessions. These include security/isolation guarantees and also the existing rich set of management functionalities. But some may say, “These benefits don’t come for free: what about the performance overhead of running containers in a VM?”

A recent report compared the performance of a Docker container to a KVM VM and showed very poor performance in some micro-benchmarks and real-world use cases: up to 60% degradation. These results were somewhat surprising to those of us accustomed to near-native performance of virtual machines, so we set out to do similar experiments with VMware vSphere. Below, we present our findings of running Docker containers in a vSphere VM and in a native configuration. Briefly,

We find that for most of these micro-benchmarks and Redis tests, vSphere delivered near-native performance with generally less than 5% overhead.
Running an application in a Docker container in a vSphere VM has very similar overhead of running containers on a native OS (directly on a physical server).

Next, we present the configuration and benchmark details as well as the performance results.

Deployment Scenarios

We compare four different scenarios as illustrated below:

Native: Linux OS running directly on hardware (Ubuntu, CentOS)
vSphere VM: Upcoming release of vSphere with the same guest OS as native
Native-Docker: Docker version 1.2 running on a native OS
VM-Docker: Docker version 1.2 running in guest VM on a vSphere host

In each configuration all the power management features are disabled in the BIOS and Ubuntu OS.

Figure 1. Different test scenarios

Benchmarks/Workloads

For this study, we used the micro-benchmarks listed below and also simulated a real-world use case.

Micro-benchmarks:

LINPACK: This benchmark solves a dense system of linear equations. For large problem sizes it has a large working set and does mostly floating point operations.
STREAM: This benchmark measures memory bandwidth across various configurations.
FIO: This benchmark is used for I/O benchmarking for block devices and file systems.
Netperf: This benchmark is used to measure network performance.

Real-world workload:

Redis: In this experiment, many clients perform continuous requests to the Redis server (key-value datastore).

For all of the tests, we run multiple iterations and report the average of multiple runs.

Performance Results

LINPACK

LINPACK solves a dense system of linear equations (Ax=b), measures the amount of time it takes to factor and solve the system of N equations, converts that time into a performance rate, and tests the results for accuracy. We used an optimized version of the LINPACK benchmark binary based on the Intel Math Kernel Library (MKL).

Hardware: 4 socket Intel Xeon E5-4650 2.7GHz with 512GB RAM, 32 total cores, Hyper-Threading disabled
Software: Ubuntu 14.04.1 with Docker 1.2
VM configuration: 32 vCPU VM with 45K and 65K problem sizes

Figure 2. LINPACK performance for different test scenarios

We disabled HT for this run as recommended by the benchmark guidelines to get the best peak performance. For the 45K problem size, the benchmark consumed about 16GB memory. All memory was backed by transparent large pages. For VM results, large pages were used both in the guest (transparent large pages) and at the hypervisor level (default for vSphere hypervisor). There was 1-2% run-to-run variation for the 45K problem size. For 65K size, 33.8GB memory was consumed and there was less than 1% variation.

As shown in Figure 2, there is almost negligible virtualization overhead in the 45K problem size. For a bigger problem size, there is some inherent hardware virtualization overhead due to nested page table walk. This results in the 5% drop in performance observed in the VM case. There is no additional overhead of running the application in a Docker container in a VM compared to running the application directly in the VM.

STREAM

We used a NUMA-aware STREAM benchmark, which is the classical STREAM benchmark extended to take advantage of NUMA systems. This benchmark measures the memory bandwidth across four different operations: Copy, Scale, Add, and Triad.

Hardware: 4 socket Intel Xeon E5-4650 2.7GHz with 512GB RAM, 32 total cores, HT enabled
Software: Ubuntu 14.04.1 with Docker 1.2
VM configuration: 64 vCPU VM (Hyper-Threading ON)

Figure 3. STREAM performance for different test scenarios

We used an array size of 2 billion, which used about 45GB of memory. We ran the benchmark with 64 threads both in the native and virtual cases. As shown in Figure 3, the VM added about 2-3% overhead across all four operations. The small 1-2% overhead of using a Docker container on a native platform is probably in the noise margin.

FIO

We used Flexible I/O (FIO) tool version 2.1.3 to compare the storage performance for the native and virtual configurations, with Docker containers running in both. We created a 10GB file in a 400GB local SSD drive and used direct I/O for all our tests so that there were no effects of buffer caching inside the OS. We used a 4k I/O size and tested three different I/O profiles: random 100% read, random 100% write, and a mixed case with random 70% read and 30% write. For the 100% random read and write tests, we selected 8 threads and an I/O depth of 16, whereas for the mixed test, we select an I/O depth of 32 and 8 threads. We use the taskset to set the CPU affinity on FIO threads in all configurations. All the details of the experimental setup are given below:

Hardware: 2 socket Intel Xeon E5-2660 2.2GHz with 392GB RAM, 16 total cores, Hyper-Threading enabled
Guest: 32-vCPU 14.04.1 Ubuntu 64-bit server with 256GB RAM, with a separate ext4 disk in the guest (on VMFS5 in vSphere run)
Benchmark: FIO, Direct I/O, 10GB file
I/O Profile: 4k I/O, Random Read/Write: depth 16, jobs 8, Mixed: depth 32, jobs 8

Figure 4. FIO benchmark performance for different test scenarios

The figure above shows the normalized maximum IOPS achieved for different configurations and different I/O profiles. For random read in a VM, we see that there is about 2% reduction in maximum achievable IOPS when compared to the native case. However, for the random write and mixed tests, we observed almost the same performance (within the noise margin) compared to the native configuration.

Netperf

Netperf is used to measure throughput and latency of networking operations. All the details of the experimental setup are given below:

Hardware (Server): 4 socket Intel Xeon E5-4650 2.7GHz with 512GB RAM, 32 total cores, Hyper-Threading disabled
Hardware (Client): 2 socket Intel Xeon X5570 2.93GHz with 64GB RAM, 8 cores total, Hyper-Threading disabled
Networking hardware: Broadcom Corporation NetXtreme II BCM57810
Software on server and Client: Ubuntu 14.04.1 with Docker 1.2
VM configuration: 2 vCPU VM with 4GB RAM

The server machine for Native is configured to have only 2 CPUs online for fair comparison with a 2-vCPU VM. The client machine is also configured to have 2 CPUs online to reduce variability. We tested four configurations: directly on the physical hardware (Native), in a Docker container (Native-Docker), in a virtual machine (VM), and in a Docker container inside a VM (VM-Docker). For the two Docker deployment scenarios, we also studied the effect of using host networking as opposed to the Docker bridge mode (default operating mode), resulting in two additional configurations (Native-Docker-HostNet and VM-Docker-HostNet) making total six configurations.

We used TCP_STREAM and TCP_RR tests to measure the throughput and round-trip network latency between the server machine and the client machine using a direct 10Gbps Ethernet link between two NICs. We used standard network tuning like TCP window scaling and setting socket buffer sizes for the throughput tests.

Figure 5. Netperf Recieve performance for different test scenarios

Figure 6. Netperf transmit performance for different test scenarios

Figures 5 and 6 show the unidirectional throughput over a single TCP connection with standard 1500 byte MTU for both transmit and receive TCP_STREAM cases (We used multiple Streams in VM-Docker* transmit case to reduce the variability in runs due to Docker bridge overhead and get predictable results). Throughput numbers for all configurations are identical and equal to the maximum possible 9.40Gbps on a 10GbE NIC.

Figure 7. Netperf TCP_RR performance for different test scenarios (Lower is better)

For the latency tests, we used the latency sensitivity feature introduced in vSphere5.5 and applied the best practices for tuning latency in a VM as mentioned in this white paper. As shown in Figure 7, latency in a VM with VMXNET3 device is only 15 microseconds more than in the native case because of the hypervisor networking stack. If users wish to reduce the latency even further for extremely latency- sensitive workloads, pass-through mode or SR-IOV can be configured to allow the guest VM to bypass the hypervisor network stack. This configuration can achieve similar round-trip latency to native, as shown in Figure 8. The Native-Docker and VM-Docker configuration adds about 9-10 microseconds of overhead due to the Docker bridge NAT function. A Docker container (running natively or in a VM) when configured to use host networking achieves similar latencies compared to the latencies observed when not running the workload in a container (native or a VM).

Figure 8. Netperf TCP_RR performance for different test scenarios (VMs in pass-through mode)

Redis

We also wanted to take a look at how Docker in a virtualized environment performs with real world applications. We chose Redis because: (1) it is a very popular application in the Docker space (based on the number of pulls of the Redis image from the official Docker registry); and (2) it is very demanding on several subsystems at once (CPU, memory, network), which makes it very effective as a whole system benchmark.

Our test-bed comprised two hosts connected by a 10GbE network. One of the hosts ran the Redis server in different configurations as mentioned in the netperf section. The other host ran the standard Redis benchmark program, redis-benchmark, in a VM.

The details about the hardware and software used in the experiments are the following:

Hardware: HP ProLiant DL380e Gen8 2 socket Intel Xeon E5-2470 2.3GHz with 96GB RAM, 16 total cores, Hyper-Threading enabled
Guest OS: CentOS 7
VM: 16 vCPU, 93GB RAM
Application: Redis 2.8.13
Benchmark: redis-benchmark, 1000 clients, pipeline: 1 request, operations: SET 1 Byte
Software configuration: Redis thread pinned to CPU 0 and network interrupts pinned to CPU 1

Since Redis is a single-threaded application, we decided to pin it to one of the CPUs and pin the network interrupts to an adjacent CPU in order to maximize cache locality and avoid cross-NUMA node memory access. The workload we used consists of 1000 clients with a pipeline of 1 outstanding request setting a 1 byte value with a randomly generated key in a space of 100 billion keys. This workload is highly stressful to the system resources because: (1) every operation results in a memory allocation; (2) the payload size is as small as it gets, resulting in very large number of small network packets; (3) as a consequence of (2), the frequency of operations is extremely high, resulting in complete saturation of the CPU running Redis and a high load on the CPU handling the network interrupts.

We ran five experiments for each of the above-mentioned configurations, and we measured the average throughput (operations per second) achieved during each run. The results of these experiments are summarized in the following chart.

Figure 9. Redis performance for different test scenarios

The results are reported as a ratio with respect to native of the mean throughput over the 5 runs (error bars show the range of variability over those runs).

Redis running in a VM has slightly lower performance than on a native OS because of the network virtualization overhead introduced by the hypervisor. When Redis is run in a Docker container on native, the throughput is significantly lower than native because of the overhead introduced by the Docker bridge NAT function. In the VM-Docker case, the performance drop compared to the Native-Docker case is almost exactly the same small amount as in the VM-Native comparison, again because of the network virtualization overhead. However, when Docker runs using host networking instead of its own internal bridge, near-native performance is observed for both the Docker on native hardware and Docker in VM cases, reaching 98% and 96% of the maximum throughput respectively.

Based on the above results, we can conclude that virtualization introduces only a 2% to 4% performance penalty. This makes it possible to run applications like Redis in a Docker container inside a VM and retain all the virtualization advantages (security and performance isolation, management infrastructure, and more) while paying only a small price in terms of performance.

Summary

In this blog, we showed that in addition to the well-known security, isolation, and manageability advantages of virtualization, running an application in a Docker container in a vSphere VM adds very little performance overhead compared to running the application in a Docker container on a native OS. Furthermore, we found that a container in a VM delivers near native performance for Redis and most of the micro-benchmark tests we ran.

In this post, we focused on the performance of running a single instance of an application in a container, VM, or native OS. We are currently exploring scale-out applications and the performance implications of deploying them on various combinations of containers, VMs, and native operating systems. The results will be covered in the next installment of this series. Stay tuned!