Virtualization

Docker Containers Performance in VMware vSphere

by Qasim Ali, Banit Agrawal, and Davide Bergamasco

“Containers without compromise.” This was one of the key messages at VMworld 2014 USA in San Francisco. It was presented in the opening keynote, and then the advantages of running Docker containers inside of virtual machines were discussed in detail in several breakout sessions. These include security/isolation guarantees and also the existing rich set of management functionalities. But some may say, “These benefits don’t come for free: what about the performance overhead of running containers in a VM?”

A recent report compared the performance of a Docker container to a KVM VM and showed very poor performance in some micro-benchmarks and real-world use cases: up to 60% degradation. These results were somewhat surprising to those of us accustomed to near-native performance of virtual machines, so we set out to do similar experiments with VMware vSphere. Below, we present our findings of running Docker containers in a vSphere VM and  in a native configuration. Briefly,

  • We find that for most of these micro-benchmarks and Redis tests, vSphere delivered near-native performance with generally less than 5% overhead.
  • Running an application in a Docker container in a vSphere VM has very similar overhead of running containers on a native OS (directly on a physical server).

Next, we present the configuration and benchmark details as well as the performance results.

Deployment Scenarios

We compare four different scenarios as illustrated below:

  • Native: Linux OS running directly on hardware (Ubuntu, CentOS)
  • vSphere VM: Upcoming release of vSphere with the same guest OS as native
  • Native-Docker: Docker version 1.2 running on a native OS
  • VM-Docker: Docker version 1.2 running in guest VM on a vSphere host

In each configuration all the power management features are disabled in the BIOS and Ubuntu OS.

Figure 1. Different test scenarios

Benchmarks/Workloads

For this study, we used the micro-benchmarks listed below and also simulated a real-world use case.

Micro-benchmarks:

  • LINPACK: This benchmark solves a dense system of linear equations. For large problem sizes it has a large working set and does mostly floating point operations.
  • STREAM: This benchmark measures memory bandwidth across various configurations.
  • FIO: This benchmark is used for I/O benchmarking for block devices and file systems.
  • Netperf: This benchmark is used to measure network performance.

Real-world workload:

  • Redis: In this experiment, many clients perform continuous requests to the Redis server (key-value datastore).

For all of the tests, we run multiple iterations and report the average of multiple runs.

Performance Results

LINPACK

LINPACK solves a dense system of linear equations (Ax=b), measures the amount of time it takes to factor and solve the system of N equations, converts that time into a performance rate, and tests the results for accuracy. We used an optimized version of the LINPACK benchmark binary based on the Intel Math Kernel Library (MKL).

  • Hardware: 4 socket Intel Xeon E5-4650 2.7GHz with 512GB RAM, 32 total cores, Hyper-Threading disabled
  • Software: Ubuntu 14.04.1 with Docker 1.2
  • VM configuration: 32 vCPU VM with 45K and 65K problem sizes

Figure 2. LINPACK performance for different test scenarios

We disabled HT for this run as recommended by the benchmark guidelines to get the best peak performance. For the 45K problem size, the benchmark consumed about 16GB memory. All memory was backed by transparent large pages. For VM results, large pages were used both in the guest (transparent large pages) and at the hypervisor level (default for vSphere hypervisor). There was 1-2% run-to-run variation for the 45K problem size. For 65K size, 33.8GB memory was consumed and there was less than 1% variation.

As shown in Figure 2, there is almost negligible virtualization overhead in the 45K problem size. For a bigger problem size, there is some inherent hardware virtualization overhead due to nested page table walk. This results in the 5% drop in performance observed in the VM case. There is no additional overhead of running the application in a Docker container in a VM compared to running the application directly in the VM.

STREAM

We used a NUMA-aware  STREAM benchmark, which is the classical STREAM benchmark extended to take advantage of NUMA systems. This benchmark measures the memory bandwidth across four different operations: Copy, Scale, Add, and Triad.

  • Hardware: 4 socket Intel Xeon E5-4650 2.7GHz with 512GB RAM, 32 total cores, HT enabled
  • Software: Ubuntu 14.04.1 with Docker 1.2
  • VM configuration: 64 vCPU VM (Hyper-Threading ON)

Figure 3. STREAM performance for different test scenarios

We used an array size of 2 billion, which used about 45GB of memory. We ran the benchmark with 64 threads both in the native and virtual cases. As shown in Figure 3, the VM added about 2-3% overhead across all four operations. The small 1-2% overhead of using a Docker container on a native platform is probably in the noise margin.

FIO

We used Flexible I/O (FIO) tool version 2.1.3 to compare the storage performance for the native and virtual configurations, with Docker containers running in both. We created a 10GB file in a 400GB local SSD drive and used direct I/O for all our tests so that there were no effects of buffer caching inside the OS. We used a 4k I/O size and tested three different I/O profiles: random 100% read, random 100% write, and a mixed case with random 70% read and 30% write. For the 100% random read and write tests, we selected 8 threads and an I/O depth of 16, whereas for the mixed test, we select an I/O depth of 32 and 8 threads. We use the taskset to set the CPU affinity on FIO threads in all configurations. All the details of the experimental setup are given below:

  • Hardware: 2 socket Intel Xeon E5-2660 2.2GHz with 392GB RAM, 16 total cores, Hyper-Threading enabled
  • Guest: 32-vCPU  14.04.1 Ubuntu 64-bit server with 256GB RAM, with a separate ext4 disk in the guest (on VMFS5 in vSphere run)
  • Benchmark:  FIO, Direct I/O, 10GB file
  • I/O Profile:  4k I/O, Random Read/Write: depth 16, jobs 8, Mixed: depth 32, jobs 8

Figure 4. FIO benchmark performance for different test scenarios

The figure above shows the normalized maximum IOPS achieved for different configurations and different I/O profiles. For random read in a VM, we see that there is about 2% reduction in maximum achievable IOPS when compared to the native case. However, for the random write and mixed tests, we observed almost the same performance (within the noise margin) compared to the native configuration.

Netperf

Netperf is used to measure throughput and latency of networking operations. All the details of the experimental setup are given below:

  • Hardware (Server): 4 socket Intel Xeon E5-4650 2.7GHz with 512GB RAM, 32 total cores, Hyper-Threading disabled
  • Hardware (Client): 2 socket Intel Xeon X5570 2.93GHz with 64GB RAM, 8 cores total, Hyper-Threading disabled
  • Networking hardware: Broadcom Corporation NetXtreme II BCM57810
  • Software on server and Client: Ubuntu 14.04.1 with Docker 1.2
  • VM configuration: 2 vCPU VM with 4GB RAM

The server machine for Native is configured to have only 2 CPUs online for fair comparison with a 2-vCPU VM. The client machine is also configured to have 2 CPUs online to reduce variability. We tested four configurations: directly on the physical hardware (Native), in a Docker container (Native-Docker), in a virtual machine (VM), and in a Docker container inside a VM (VM-Docker). For the two Docker deployment scenarios, we also studied the effect of using host networking as opposed to the Docker bridge mode (default operating mode), resulting in two additional configurations (Native-Docker-HostNet and VM-Docker-HostNet) making total six configurations.

We used TCP_STREAM and TCP_RR tests to measure the throughput and round-trip network latency between the server machine and the client machine using a direct 10Gbps Ethernet link between two NICs. We used standard network tuning like TCP window scaling and setting socket buffer sizes for the throughput tests.

Figure 5. Netperf Recieve performance for different test scenarios

Figure 6. Netperf transmit performance for different test scenarios

Figures 5 and 6 show the unidirectional throughput over a single TCP connection with standard 1500 byte MTU for both transmit and receive TCP_STREAM cases (We used multiple Streams in VM-Docker* transmit case to reduce the variability in runs due to Docker bridge overhead and get predictable results). Throughput numbers for all configurations are identical and equal to the maximum possible 9.40Gbps on a 10GbE NIC.

Figure 7. Netperf TCP_RR performance for different test scenarios (Lower is better)

For the latency tests, we used the latency sensitivity feature introduced in vSphere5.5 and applied the best practices for tuning latency in a VM as mentioned in this white paper. As shown in Figure 7, latency in a VM with VMXNET3 device is only 15 microseconds more than in the native case because of the hypervisor networking stack. If users wish to reduce the latency even further for extremely latency- sensitive workloads, pass-through mode or SR-IOV can be configured to allow the guest VM to bypass the hypervisor network stack. This configuration can achieve similar round-trip latency to native, as shown in Figure 8. The Native-Docker and VM-Docker configuration adds about 9-10 microseconds of overhead due to the Docker bridge NAT function. A Docker container (running natively or in a VM) when configured to use host networking achieves similar latencies compared to the latencies observed when not running the workload in a container (native or a VM).

Figure 8. Netperf TCP_RR performance for different test scenarios (VMs in pass-through mode)

Redis

We also wanted to take a look at how Docker in a virtualized environment performs with real world applications. We chose Redis because: (1) it is a very popular application in the Docker space (based on the number of pulls of the Redis image from the official Docker registry); and (2) it is very demanding on several subsystems at once (CPU, memory, network), which makes it very effective as a whole system benchmark.

Our test-bed comprised two hosts connected by a 10GbE network. One of the hosts ran the Redis server in different configurations as mentioned in the netperf section. The other host ran the standard Redis benchmark program, redis-benchmark, in a VM.

The details about the hardware and software used in the experiments are the following:

  • Hardware: HP ProLiant DL380e Gen8 2 socket Intel Xeon E5-2470 2.3GHz with 96GB RAM, 16 total cores, Hyper-Threading enabled
  • Guest OS: CentOS 7
  • VM: 16 vCPU, 93GB RAM
  • Application: Redis 2.8.13
  • Benchmark: redis-benchmark, 1000 clients, pipeline: 1 request, operations: SET 1 Byte
  • Software configuration: Redis thread pinned to CPU 0 and network interrupts pinned to CPU 1

Since Redis is a single-threaded application, we decided to pin it to one of the CPUs and pin the network interrupts to an adjacent CPU in order to maximize cache locality and avoid cross-NUMA node memory access.  The workload we used consists of 1000 clients with a pipeline of 1 outstanding request setting a 1 byte value with a randomly generated key in a space of 100 billion keys.  This workload is highly stressful to the system resources because: (1) every operation results in a memory allocation; (2) the payload size is as small as it gets, resulting in very large number of small network packets; (3) as a consequence of (2), the frequency of operations is extremely high, resulting in complete saturation of the CPU running Redis and a high load on the CPU handling the network interrupts.

We ran five experiments for each of the above-mentioned configurations, and we measured the average throughput (operations per second) achieved during each run.  The results of these experiments are summarized in the following chart.

Figure 9. Redis performance for different test scenarios

The results are reported as a ratio with respect to native of the mean throughput over the 5 runs (error bars show the range of variability over those runs).

Redis running in a VM has slightly lower performance than on a native OS because of the network virtualization overhead introduced by the hypervisor. When Redis is run in a Docker container on native, the throughput is significantly lower than native because of the overhead introduced by the Docker bridge NAT function. In the VM-Docker case, the performance drop compared to the Native-Docker case is almost exactly the same small amount as in the VM-Native comparison, again because of the network virtualization overhead.  However, when Docker runs using host networking instead of its own internal bridge, near-native performance is observed for both the Docker on native hardware and Docker in VM cases, reaching 98% and 96% of the maximum throughput respectively.

Based on the above results, we can conclude that virtualization introduces only a 2% to 4% performance penalty.  This makes it possible to run applications like Redis in a Docker container inside a VM and retain all the virtualization advantages (security and performance isolation, management infrastructure, and more) while paying only a small price in terms of performance.

Summary

In this blog, we showed that in addition to the well-known security, isolation, and manageability advantages of virtualization, running an application in a Docker container in a vSphere VM adds very little performance overhead compared to running the application in a Docker container on a native OS. Furthermore, we found that a container in a VM delivers near native performance for Redis and most of the micro-benchmark tests we ran.

In this post, we focused on the performance of running a single instance of an application in a container, VM, or native OS. We are currently exploring scale-out applications and the performance implications of deploying them on various combinations of containers, VMs, and native operating systems.  The results will be covered in the next installment of this series. Stay tuned!

Comments

35 comments have been added so far

  1. Nice article. What was the disk configuration for the fio test? single disk, raid 5/6 or raid 10? Local or SAN?

  2. Why is the exact version of the hypervisor not mentioned? And why don’t we get the commandlines of the benchmarks?
    Why do you make it extra difficult to rerun those benchmarks on different setups?

  3. Great article.

    I am little confused about VM and Docker. This article and other says Docker is as good as VM.

    If the performance of VM and Docker is the same then what is the advantage of using Docker? I have all the benefits under VM. So what I am missing here?

    Shair Khan

    1. While the performance may be similar, Docker takes advantage of containerization to utilize the host OS, rather than having to load a whole additional OS into memory to run a VM. As a result, it has lower resource usage overheads meaning you can a) run smaller (cheaper) servers, and/or b) run more dockerized applications on a single server, each of them sandboxed, which would otherwise require you to run a VM for each application.

  4. Yes, I’m also confused a little about Docker and VM, this container is quite like the Zones in Solaris, and such containers are supposed to replace VM in the future since it’s lightweight, although this post is discussing the performance running a Docker container in a VM like KVM and VMWare VM (the result shows VMWare VM is much better than KVM), but why we still need to running a container in a VM, we just need lots fo containers in a native OS.
    And then here comes a question, if we run lots of similar containers in a native OS, like web service container or mid-ware apps container, why do not we just configure the service/apps as clusters with lots of instances in a host like years ago before cloud computing. Nowadays most of the applications have cluster/HA ability, like weblogic, jboss, oracle, etc. Unless we will run different roles in a host, for instance, put web service container and database container in a same host, confusing that will happen in the future as PaaS ?
    Is Docker easier for deployment, there are still a lot of work to do with the configuration changes after deployed to another host.

    app clustered in a host —> load balancer + apps in VMs —> apps in Docker/Containers in a host —
    ^ |
    | |
    | ———————————————–will it change back ? ————————————————————|

    1. Hi Jeff,

      Docker actually is Zones on Solaris – Docker unifies api for containers on different operating systems and leverages container mechanisms already available. On Linux it uses LXC, on Solaris it uses Zones, and on BSD it uses jails.
      Regarding your question about whether it makes sense to run containers in VM or not – I think it still makes a lot of sense, for two reasons, one theoretical and one practical:
      – theoretical is that containers provide different level of isolation – you have separate resources and processes, but you share the same kernel, e.g. that means that each container on particular system has to be the same like a parent – if you run containers on linux machine, they all have to be linux OSes. Docker tries to solve that by unifying the API between different OSes, but you still have to provision those OSes, which leads to another reason –
      – practical one. It’s much easier to provision different OSes using VMs and tools like vSphere ecosystem than using a bare metal. There are things like MAAS, but it’s hard to compare them with the vSphere in terms of advancement and scalability. In my opinion today’s challenge is not to ensure HA, as this has been solved, but be able to manage big heterogenous environments without hassle, because today you have many more VMs and clusters than you had couple years ago for the same cost. Existing tools allow that for VMs. There is still a work to when it comes to containers, but VMs + containers should be the way to go for the next couple of years.

      1. Hi Karol, could you please answer the below?

        Operations provision machines via VM template (Linux and Windows) and Redhat satellite (Linux only). Deployments in our environments can be cumbersome. This usually involves using yum on our redhat servers to uninstall and reinstall new components. Modifying entries in configuration files, running scripts on (Linux AND Windows) servers all in a particular order and checking components as they come up. Once completed the environment must be tested by 3 different teams often involving a lot of manual testing using terminals.
        1. What can we do to improve the way these are handled?
        2. Are there any tool(s) you would recommend, and what are the attributes of the tool that make it suitable for our situation? (WINDOWS, LINUX, AWS)
        3. How would this fit into the way in which the operations team provide base infrastructure components.

  5. For the netperf in 10G ethernet:
    “Throughput numbers for all configurations are identical and equal to the maximum possible 9.40Gbps on a 10GbE NIC.”

    How did you saturated the 10G by netperf? What netperf (client) command used to triagger the 9.4Gbpstraffic?

    Thanks!

  6. Conclusion seems simple – use docker and save the costly vmware licenses, performance is near identicsl. Win-win.

  7. Could nybody answer the below questions? is Docker with VM the answer and solution for this? Thanks.
    Operations provision machines via VM template (Linux and Windows) and Redhat satellite (Linux only). Deployments in our environments can be cumbersome. This usually involves using yum on our redhat servers to uninstall and reinstall new components. Modifying entries in configuration files, running scripts on (Linux AND Windows) servers all in a particular order and checking components as they come up. Once completed the environment must be tested by 3 different teams often involving a lot of manual testing using terminals.
    1. What can we do to improve the way these are handled?
    2. Are there any tool(s) you would recommend, and what are the attributes of the tool that make it suitable for our situation? (WINDOWS, LINUX, AWS)
    3. How would this fit into the way in which the operations team provide base infrastructure components.

Leave a Reply

Your email address will not be published. Required fields are marked *