How VMware Tanzu Application Platform Has Improved with Frequent Longevity Testing

VMware Tanzu Application Platform is a modular, application-aware platform that provides a rich set of developer tooling and a pre-paved path to production, enabling developers to build and deploy software quickly and securely on any compliant public cloud or on-premises Kubernetes cluster.

This blog details how the VMware engineering team achieved enterprise quality for Tanzu Application Platform with longevity testing within a few months of the product’s launch. Longevity testing, also known as soak testing or endurance testing, is a test methodology in which a baseline load keeps running on a system to evaluate an enterprise software stack. This method is used to evaluate heavy usage on a system after a prolonged operational period, depending on the complexity and size of the system under test. Longevity testing helped the VMware engineering team identify potential bottlenecks with the implementation of Tanzu Application Platform and ensured that it would achieve true enterprise quality.

Functions to test

The first step involved in any longevity testing is to identify the functions of a system that needs to be exercised. In the case of Tanzu Application Platform, testing the supply chain for longevity was the focus. We opted to test with the Testing_Scanning supply chain, which covers most of the components of Tanzu Application Platform. We identified workload deployment, workload delete, and workload updates as the main workflows that would cover the following:

Pipeline runs
Source scan
Build
Image scan
Workload deployment – runtime

Identifying the baseline configuration

We started longevity testing right from the first release (1.0.0), which had only full profile and lite profile installation of Tanzu Application Platform. We focused only on full profile, since lite is aimed for local environments like laptops. The Kubernetes distribution we focused on initially was Azure Kubernetes Service (AKS).

Starting with the 1.1.0 release, Tanzu Application Platform comes with multi-cluster environment support, where we have different profiles for shared iterate clusters, build clusters, production run clusters, and view clusters. Since then, we have enhanced longevity coverage to include multi-cluster environments as well. Since the full profile of Tanzu Application Platform has all components, we were able to uncover all these issues with full profile itself. The following are the baseline configurations that we’ve been using.

We’ve taken tanzu-java-web-app and spring-petclinic as the workloads for longevity testing to begin with. Since then we’ve also added workloads covering services tool kit. This blog will focus primarily on the results and improvements we got with full profile on an AKS cluster.

Cluster

Cluster Configuration

Workload Configuration

Single

Full Profile with Testing_Scanning Supply Chain

Public Cloud – AKS Cluster with 4 Nodes. (No Autoscale),

Private Cloud – TKGm on Vsphere Cluster (Prod) with 4 Worker Nodes and 3 Control Plane nodes

Node configuration – 4 vCPUs and 16 GB RAM

Long Running Apps and Churn – 4 instances of tanzu-java-web-app and spring-petclinic (1 instance of spring-petclinic app using service bindings) workload with updates every alternate 2 hours for each app.

Creation and Deletion cycles of Workload – 1 instance of tanzu-java-web-app & spring-petclinic workload going through creation and deletion cycle every alternate hour.

Max Number of workloads at a point in time – 10 workloads are constantly accessed.

Iterate

Iterate Profile with Testing Supply Chain

Public Cloud – AKS Cluster with 4 Nodes. (No Autoscale)

Private Cloud – TKGm on Vsphere Cluster (Prod) with 4 Worker Nodes and 3 Control Plane nodes

Node configuration – 4 vCPUs and 16 GB RAM

Long Running Apps and Churn – 5 instances each of tanzu-java-web-app workload with live update happening every 2 hours.

Max Number of workloads at a point in time – 5 workloads are constantly accessed.

Build+Production

Build Profile with Testing Scanning Supply Chain + Run Profile

Build Cluster:

Public Cloud – AKS Cluster with 3 Nodes. (No Autoscale)

Private Cloud – TKGm on Vsphere Cluster (Prod) with 3 Worker Nodes and 3 Control Plane nodes

Node configuration – 4 vCPUs and 16 GB RAM

Run Cluster:

Public Cloud – AKS Cluster with 2 Nodes. (No Autoscale)

Private Cloud – TKGm on Vsphere Cluster (Prod) with 2 Worker Nodes and 3 Control Plane nodes

Node configuration – 4 vCPUs and 16 GB RAM

Creation and Deletion cycles of Workload – 1 instance of tanzu-java-web-app & spring-petclinic workload going through creation and deletion cycle every alternate hour.

Max Number of workloads at a point in time – 10 workloads are constantly accessed.

View

View Profile

Public Cloud – AKS Cluster with 2 Nodes. (No Autoscale)

Private Cloud – TKGm on Vsphere Cluster (Prod) with 2 Worker Nodes and 3 Control Plane nodes

Node configuration – 4 vCPUs and 16 GB RAM

N/A

General parameters to check

The following are the major parameters that were being looked as as effects of long-running workflows in a cluster with Tanzu Application Platform installed:

Node wise memory utilization
Node wise CPU utilization
Component container memory utilization per namespace
Component container CPU utilization per namespace
Pod restarts

Tooling to monitor

We used Tanzu Observability by Wavefront to capture the aforementioned resource utilization metrics.

Parameters specific to Tanzu Application Platform

We captured the number of config maps and api-resources generated by each component of Tanzu Application Platform. This helped in identifying the issues related to garbage collection not done and phantom diffs generated by different components of Tanzu Application Platform.

Class of issues

Pod restarts
Nodes hitting OOM (out of memory)
Nodes going down
Memory spikes across different components
Components generating huge amounts of config map and phantom diffs
Workload updates not going through
Garbage collection of resources not happening across components
Low resource limits allocated components leading them to crash

Findings

There were many issues found with our testing, as listed in the previous sections. Here, we list out those that paved the way for significant improvements in Tanzu Application Platform.

Tanzu Application Platform 1.0.0 – With the first run, we observed that the cluster did not even hold up for two days. The nodes hit out of memory (OOM), and one node even went down. The workload updates stopped working after two days. A couple of major issues were identified with Cartographer sync time and Tekton pipelines garbage collection. In the same release, the Cartographer sync changes were made and it improved the uptime to four days. The nodes still hit OOM, but at the fifth day. Multiple pods restarted more than 1,000 times.
Tanzu Application Platform 1.0.1 – With this release, a few more improvements were made around Tekton pipeline garbage collection, bumping up of allocated resources for Cartographer, and also garbage collection of runnables. This further improved the cluster performance, and we did not hit OOM conditions. The workload updates still stopped after 10 days. There were fewer pods restarting 1,000 times. Peak memory utilization was 95 percent.
Tanzu Application Platform 1.0.2 – With the number of api-resources being collected, we observed that phantom diffs generated by different components contribute to much of the cluster resource utilization. There was a concerted effort made by different component teams to fix this, which resulted in considerable improvement in the cluster uptime. There was only one pod that restarted a few hundred times. Workload updates continued till 12 days. Peak memory utilization was 85 percent. With 1.0.2, we also started testing on Tanzu Kubernetes Grid multi-cloud (TKGm) vSphere. We found TKGm vSphere to perform better than AKS with the workload updated going on till 20 days and 50 percent being the peak memory utilization. The pod that restarted was also less than 100 times as compared to AKS.
Tanzu Application Platform 1.1.0 – Few phantom diff changes pending with the prior release were fixed in the release, which contributed to further improvements. Only one pod restarted a few hundred times. Workload updates continued till 14 days. Peak memory utilization was 80 percent. The latest kapp-controller changes also improved the performance of the cluster. We also expanded the scope of testing to cover more Kubernetes distributions. We found out that different Kubernetes distributions behaved differently.
Tanzu Application Platform 1.2.0 – Few other issues on pod restarts and memory utilization spikes were fixed. The workload updates continued for more than 30 days.

The summary of the improvements are best explained by the following charts.

The value of iterative improvement

This process of longevity testing, as well as the creation of static longevity clusters that represent a true customer-like environment, have helped improve Tanzu Application Platform and move it toward achieving enterprise quality. We continue to do this at a regular cadence, finding new issues and fixing regressions. We are also adding support for more Kubernetes distributions in Tanzu Application Platform, along with introducing new features, workloads, and workflows with new releases.

Functions to test

Identifying the baseline configuration

General parameters to check

Tooling to monitor

Parameters specific to Tanzu Application Platform

Class of issues

Findings

The value of iterative improvement

Related Articles

New in Tanzu Application Platform 1.9: Innovating in Highly Regulated Environments

Gain Insights into the Risks You Face from Open Source Dependencies with VMware Tanzu OSS Health Assessment

Spring Cloud Gateway for Kubernetes 2.2: A Focus on Enhanced GraphQL API Support

New in Tanzu Application Platform 1.8: Code in confidence with SLSA Level 3 and Secure Builds

Improving Kubernetes Operations One Step at a Time