Performance Best Practices for Kubernetes with VMware Tanzu

VMware Tanzu provides an integrated solution for deploying developer and production-ready clusters for Kubernetes workloads. A new technical paper provides guidance to administrators and developers seeking to maximize workload performance and resource efficiency when configuring Kubernetes clusters and deploying workloads using VMware Tanzu. This includes guidance on choices such as:

Sizing VMware Tanzu Kubernetes Grid (TKG) clusters and selecting virtual machine classes for worker nodes
Configuring pod and container resource specifications
Choosing the right level of CPU and memory over-commitment with the least impact on workload performance

We developed and validated the best practices by performing extensive testing with the open-source performance benchmarks Weathervane and K-Bench. Here we highlight just two of the key results. See the paper for complete details and additional results.

Size worker nodes and pods based on actual resource needs

A key to maximizing workload performance lies in understanding the actual resource usage of the components of your workload and applying that knowledge to the setting of container resource requests and limits, worker node configuration, and cluster sizing.

For example, understanding resource usage allows you to avoid over-sizing your TKG clusters. In one set of tests, performance with the Weathervane benchmark was improved by 8% simply by right-sizing a cluster to reduce overhead and achieve better co-location of workload pods.

Performance improvement of 8% when right-sizing CPU and memory — Figure 1. Performance benefit of right-sizing CPU and memory resources

Take advantage of over-commitment to achieve high resource utilization

When running TKG clusters on vSphere, you can improve the overall utilization of the cluster resources by over-committing the CPU and memory on the vSphere hosts. This allows clusters to use resources that would otherwise be left idle by clusters that have been allocated more resources than needed. However, it is important to monitor key performance metrics when over-committing to avoid contention and degradation of workload performance.

For example, in one set of tests an increasing number of TKG clusters were deployed, each with two best-effort-medium worker nodes and one best-effort-small control-plane node. Each cluster ran one instance of the Weathervane xsmall application. For each number of clusters, we ran Weathervane to determine the maximum load that the configuration could support. The results showed that peak performance was reached when the CPU cores were over-committed by 300%. At higher levels of over-commitment, the performance dropped due to contention for the CPUs. This contention was reflected by an increase in the %-ready, or readiness, metric. You can monitor readiness in the vCenter performance charts. This metric gives a good indication of when the level of over-commitment is too high given the resource usage of the deployed VMs.

Performance impact of CPU over-commitment — Figure 2. The peak load was achieved with 20 TKG clusters, which represents a 300% over-commitment of the available CPU cores.

Find full details in the paper Performance Best Practices for Kubernetes with VMware Tanzu.