by Davide Bergamasco
In an earlier VROOM! post we discussed, among other things, the performance of the Redis in-memory key-value store in a Docker/vSphere environment. In that post we focused on a single instance of a Redis server subject to a more or less artificial workload with the goal of assessing the absolute performance of said instance under various deployment scenarios.
In this post we are taking a different point of view, which is maximizing the throughput of multiple Redis instances running on a “large” server under a more realistic workload. Why are we interested in this perspective? Conceptually, Redis is an extremely simple application, being just a thin layer of code implementing a large hash table on top of system calls. From the implementation standpoint, a single-threaded event loop services requests from the clients in a polling fashion. The problem with this design is that it is not suitable for “scaling up”; that is, improving performance by using multiple cores. Modern servers have many processing cores (up to 80) and possibly terabytes of memory. However, Redis can only access that memory at the speed of a single core.
This problem can be solved by “scaling out” Redis; that is, by partitioning the server memory across multiple Redis instances and running each of those on a different core. This can be achieved by using a set of load balancers to fragment the key space and distribute the load among the various instances. The diagram shown in Figure 1 illustrates this concept.
Host H3 runs the various Redis server instances (red boxes), while Host H2 runs two sets of load balancers:
- The green boxes are the Redis load balancers, which partition the key space using a consistent hashing algorithm. We leveraged the Twemproxy OSS project to implement the Redis load balancers.
- The yellow boxes are TCP load balancers, which distribute the load across the Redis load balancers in a round robin fashion. We used the HAProxy OSS project to implement the TCP load balancers.
Finally, Host H1 runs the load generators (dark blue boxes); that is, the standard benchmark redis-benchmark.
We assessed the performance of this design across a set of deployment scenarios analogous to what we considered in the previous post. These are listed below and illustrated in Figure 2:
- Native: Redis instances are run as 8 separate processes on the Linux OS running directly on Host H3 hardware.
- VM: Redis instances are run inside 8, 2-vCPU VMs running on a pre-release build of vSphere 6.0.0 running on Host H3 hardware; the guest OS is the same as the Native scenario.
- Native-Docker: Redis instances are run inside 8 Docker containers running on the Native OS.
- VM-Docker: Redis instances are run inside Docker containers each running inside the same VMs as the VM scenario, with one container per VM.
The following are the details about the hardware, software, and workload used in the various experiments discussed in the next section:
- HP ProLiant DL380e Gen8
- CPU: 2 x Intel® Xeon® CPU E5-2470 0 @ 2.30GHz (16 cores, 32 hyper-threads total)
- Memory: 96GB
- Hardware configuration: Hyper-Threading ON, turbo-boost OFF, power policy: Static High (no power management)
- Network: 10GbE
- Storage: 8 x 500GB 15,000 RPM 6Gb SAS disks, HP H220 host bus adapter
- CentOS 7
- Kernel 3.18.1 (CentOS 7 comes with 3.10.0, but we wanted to use the latest kernel available at the time of this writing)
- Docker 1.2
- VMware vSphere 6.0.0 (pre-release build)
- 8 x 2-vCPU, 11GB (VM scenario)
- Virtual NIC: vmxnet3
- Virtual HBA: LSI-SAS
- Redis 2.8.13
- AOF persistency with “everysec” flush policy (every operation that mutates a key is logged into an Append Only File in order to enable data recovery after a crash; the buffer cache is flushed every second, so with this durability policy at most one second worth of data can be lost)
- Keyspace: 250 million keys, value size 1 byte (this size has been chosen to prevent network or storage from becoming bottlenecks from the bandwidth perspective)
- 8 redis-benchmark instances each simulating 100 clients with a pipeline depth of 30 requests
- Operations mix: 75% GET, 25% SET
We ran two sets of experiments for every scenario listed in the “Deployment Scenarios” section. The first set was meant to establish a baseline by having a single redis-benchmark instance generating requests directly against a single redis-server instance. The second set aimed at assessing the overall performance of the Redis scale-out system we presented earlier. The results of these two set of experiments are shown in Figure 3, where each bar represents the throughput in operations per second averaged over five trials, and error bars indicate the range of the measured values.
Nothing really surprising can be noticed looking at the results of the baseline experiments (labeled “1 Server – 1 Client” in Figure 3). The Native scenario is obviously the fastest in terms of operations per second, followed by the Docker, VM, and the Docker-VM scenarios. This is expected as both virtualization and containerization add some overhead on top of the bare-metal performance.
Looking at the scale-out experiments (labeled “Scale-Out” in Figure 3), we see a surprisingly different picture. The VM scenario is now the fastest, followed by Docker-VM, while the Native and Docker scenarios come in as a somewhat distant third and fourth. This unexpected result can be explained by looking at the Host H3 CPU activity during an experiment run. In the Native and Docker scenarios, notice that the CPU load is spread over the 16 cores. This means that even though only 8 threads are active (the 8 redis-server instances), the Linux scheduler is continuously migrating them. This might result in a large number of cross-NUMA node memory accesses, which are substantially more expensive than same-NUMA node accesses. Also, irqbalance is spreading the network card interrupts across all the 16 cores, additionally contributing to the above phenomenon.
In the VM and Docker-VM scenarios, this does not occur because the ESXi scheduler goes to great lengths to keep both the memory and vCPUs of a VM on one NUMA node. Also, with the PVSCSI virtual device, the virtual interrupts are always routed to the same vCPU(s) that initiated an I/O, and this minimizes interrupt migrations.
We tried to eliminate the cross-NUMA node memory activity in the Native scenario by pinning all the redis-server processes to the cores of the same CPU; that is, to the same NUMA node. We also disabled irqbalance and manually pinned the interrupt vectors to the same set of cores. As expected, with this ad-hoc configuration, the Native scenario was the fastest, reaching 3.408 million operations per second. Without any pinning, the VM result is only 4% slower than the optimized Native performance. (Notice that introducing artificial affinity between processes/interrupt vectors and cores is not a recommended practice as it is error-prone and can, in general, lead to unexpected or suboptimal results.)
Our initial experiments were conducted with the CentOS 7 stock kernel (3.10.0), which unfortunately is not particularly recent. We thought it was prudent to verify if the Linux scheduler had been improved to avoid the inter-NUMA node thread migrations in more recent kernel versions. Hence, we re-ran all the experiments with the latest version (at the time of this writing, 3.18.1), but we didn’t notice any significant difference with respect to version 3.10.0.
We thought it would be interesting to look at the performance numbers in terms of speedup; that is, the ratio between the throughput of the scale-out system and the throughput of the baseline 1 Server – 1 Client setup. Figure 4 below shows the speedup for the four scenarios considered in this study.
The speedup essentially tells, in relative terms, how much the performance has improved by deploying 8 Redis instances on the same host instead of on a single one. If the system scaled linearly, it would have achieved a maximum theoretical speedup of 8. In practice, this limit could not be achieved because of extra overheads introduced by the load-balancers and possible resource contention across the Redis instances running on host H3 (this host is almost running at saturation as the overall CPU utilization is consistently between 75% and 85% during the experiment’s execution). In any case, the scale-out system delivers a performance boost of at least 4x as compared to running a single Redis instance with exactly the same memory capacity. The VM and Docker-VM scenarios achieve a substantially larger speedup because of the cross-NUMA memory access issue afflicting the Native and Docker scenarios.
The main results of this study are the following:
- VMs and Docker containers are truly better together. The Redis scale-out system, using out-of-the-box configuration settings, clearly achieves better performance in the Docker-VM scenario than in the Native or Docker scenarios. Even though its performance is not as high as in the VM scenario, the Docker-VM setup offers the same ease of use and deployment typical of the Docker scenario, at a substantially higher performance.
- Using VMs and Docker, we managed to scale out a Redis deployment and extracted a great deal of extra performance (up to 5.6x more) from a large server that would have otherwise been underutilized.