By Karthik Ganesan and Jared Rosoff
At VMworld US 2019, VMware announced vSphere 7 with Kubernetes (Project Pacific), an evolution of vSphere into a Kubernetes-native platform. vSphere 7 with Kubernetes (among other things) introduces a vSphere Supervisor Cluster, which lets you run Kubernetes pods natively on ESXi (called vSphere Native Pods) with the same level of isolation as virtual machines. At VMworld, we claimed that vSphere Native Pods running on vSphere, isolated by the hypervisor, can achieve up to 8% better performance than pods on a bare-metal Linux Kubernetes node. While it may sound a bit counter-intuitive that virtualized performance is better than bare metal, let’s take a deeper look to understand how this is possible.
Why are vSphere Native Pods faster?
This benefit primarily comes from ESXi doing a better job at scheduling the natively run pods on the right CPUs, thus providing better localization to dramatically reduce the number of remote memory accesses. The ESXi CPU scheduler knows that these pods are independent entities and takes great efforts to ensure their memory accesses are within their respective local NUMA (non-uniform memory access) domain. This results in better performance for the workloads running inside these pods and higher overall CPU efficiency. On the other hand, the process scheduler in Linux may not provide the same level of isolation across NUMA domains.
Evaluating the performance of vSphere Native Pods
To compare the performance, we set up the testbeds shown in Figure 1 on identical hardware with 44 cores running at 2.2 GHz and 512 GB of memory:
- A vSphere 7 Supervisor Cluster with two ESXi Kubernetes nodes
- A two-node, bare-metal popular Enterprise Linux cluster with out-of-the-box settings
Typically, a hyperthreaded processor core has multiple logical cores (hyperthreads) that share hardware resources between them. On the baseline case (testbed 2), the user specifies the CPU resource requests and the maximum number of logical cores for each pod. By contrast, on a vSphere Supervisor Cluster, the CPU resource specifications for a pod are specified in terms of the number of physical cores. Given this difference in CPU resource specification, we turned off hyperthreading in both testbeds to simplify this comparison experiment and to be able to use an identical pod specification. In each of the clusters, we use one of the two Kubernetes nodes as the system under test, with the Kubernetes control plane running on the other node.
- We deploy 10 Kubernetes pods, each with a resource limit of 8 CPUs, 42 GB of RAM, and a single container in each running a standard Java transaction benchmark, as shown in Figure 2. Given the complexity and nature of the workload used for our experiments, we used large pods for this experiment to enable easier run management and score aggregation across pods.
- The pods are affinitized using the pod specification to the target Kubernetes node in each testbed.
- We use the benchmark score (maximum throughput) aggregated across all 10 pods to evaluate the performance of the system under test. The benchmark throughput scores used are sensitive to transaction latencies at various service-level agreement targets of 10ms, 25ms, 50ms, 75ms, and 100ms.
- The benchmark used has little I/O or network activity, and all the experiments are restricted to a single Kubernetes node. Therefore, we do not discuss the I/O or network performance aspects in this article.
Figure 3 shows the performance of the vSphere Supervisor Cluster (green bar) normalized based on the popular Enterprise Linux–based, bare-metal node case.
8% better performance on the vSphere Supervisor Cluster vs bare-metal Enterprise Linux
We observe that the vSphere Supervisor Cluster can achieve up to 8% better overall aggregate performance compared to the bare-metal case. We repeated the runs on each testbed many times to average out run-to-run variations.
Analysis and optimizations
Looking at the system statistics, we observe the workload running on the bare-metal testbed is suffering from many remote NUMA node memory accesses in comparison to the vSphere Supervisor Cluster case. While the vSphere Supervisor Cluster performance upside includes the improvement from better CPU scheduling, it also includes the performance overheads from virtualization.
To get deeper insights, we configure a set of pods to run the same workload at a constant throughput and collect hardware performance counter data for both testbeds to see the proportion of the last level cache (L3) misses that hit in the local DRAM vs remote DRAM. The local DRAM accesses incur much lower memory access latencies in comparison to remote DRAM accesses.
As shown in Figure 4, only half of the L3 cache misses hit in local DRAM, and the rest are served by remote memory in the case of bare-metal Enterprise Linux. In the case of the vSphere Supervisor Cluster, however, 99.2% of the L3 misses hit in local DRAM for an identical pod and workload configuration due to superior CPU scheduling in ESXi. The lower memory access latencies from avoiding remote memory accesses contributes to improved performance on the vSphere Supervisor Cluster.
About half of L3 cache misses hit Local DRAM on Linux vs most in vSphere 7 with Kubernetes.
To mitigate the effects of non-local NUMA accesses on Linux, we tried basic optimizations, like toggling NUMA balancing switches and using taskset-based pod pinning to CPUs, but none of this substantially helped performance. Numactl-based pinning of containers that forces local NUMA allocations and ensures that the processes execute in the CPUs belonging to these NUMA domains was one of the optimizations that helped. However, they are merely one-time workarounds at this point in time to help our investigation.
The chart in Figure 5 shows the improvement that can be achieved using numactl-based pinning on the bare-metal Enterprise Linux testbed.
17% performance improvements from pod pinning on bare-metal Enterprise Linux
Note that pinning practices like this are not practical when it comes to container deployment at scale and can be error-prone across heterogenous hardware. For this experiment, we chose a modest configuration for the memory usage of the workloads inside the pods, and the results are specific to these settings. If the workloads have higher or lower memory intensity, the impact of NUMA locality on their performance is expected to vary.
Conclusion and future work
In this test, we show that large pods running memory-bound workloads show better NUMA behavior and superior performance when deployed as vSphere Native Pods on a vSphere 7 –based Supervisor Cluster. We did not explore the impacts on other workloads that are not memory bound, nor did we test other pod sizes. As future work, we would like to test with a spectrum of pod sizes, including smaller pods, and try other workloads that have more I/O and network activity.
The results presented here are preliminary based on products under active development. The results may not apply to your environment and do not necessarily represent best practices.