Best Practices cloud native How-tos microservices products sre Tanzu for Kubernetes Operations Tanzu Service Mesh

Application Resiliency for Cloud Native Microservices with VMware Tanzu Service Mesh

Modern microservices-based applications bring with them a new set of challenges when it comes to operating at scale across multiple clouds. While the goal of most modernization projects is to increase the velocity at which business features are created, with this increased speed comes the need for a highly flexible, microservices-based architecture. The result is that the architectural convenience created on day 1 by developers turns into a challenge for site reliability engineers (SREs) on day 2. 

Developers expect the business features to work at scale and exhibit certain performance characteristics, but they may not know what that will ultimately cost or have the compute space necessary. SREs, on the other hand, will have the needed compute space but may not know the best way to scale the microservices to meet the set performance objectives. This situation can very quickly escalate into release slowdowns and missed communications between teams, which in turn create resiliency issues for highly distributed applications. 

What is needed is a much more automated approach in the form of a contractual agreement between developers as they define service-level indicators (SLIs) for their services, and SREs, who in turn use those SLIs to define service-level objectives (SLOs). At VMware, we think of such an agreement as an SLO policy.

In this post, we’re going to demonstrate how you can set up an SLO in Tanzu Service Mesh.     

The SLIs in SLOs

In Tanzu Service Mesh, an SLO is composed of multiple SLIs. Developers communicate with SRE teams to help identify baseline SLIs so that they can configure SLOs in the various production environments and help improve application resiliency through constant iteration. In the world of microservices, an application may comprise one or more application domains. These domains can be created to deconstruct a larger application; to represent different environments such as development, staging and production, or platforms; or to maintain separation between the various concerns of operators, application owners, and developers. 

Using an application domain construct implemented via the Global Namespace (GNS) in Tanzu Service Mesh, developers can use these namespaces to deploy microservices while application operators and SREs can define SLOs for these applications with agreed upon performance objectives via various SLIs.  

The GNS in Tanzu Service Mesh binds the SLO agreement into the system so that all services within its scope can adhere to it. Once the system that can execute according to the SLOs that developers, product owners, and SREs have agreed to via the GNS, it can be thought of as a key architectural pattern for layering cloud native applications.  

Building a resilient application

Let’s walk through how you can build a distributed resilient application using Tanzu Service Mesh. 

The first step is to create a GNS in Tanzu Service Mesh. You can create a GNS that spans one or more Kubernetes clusters, which could then be deployed on-prem and in one or more clouds.   

Next, monitor the health of your application by defining an SLO. You can set thresholds such as CPU, memory, latency, etc. as your SLIs and use them to define the SLO.  

As demand for the application grows, it should be able to react to the increased usage. With that in mind, be sure to set up an autoscaling policy.  

You can also set your SLO to influence the autoscaling behavior of your applications. 

Tanzu Service Mesh monitors the SLIs for each microservice and automatically scales the application based on them. Depending on your needs, you can set the autoscaling policy to be efficiency-based (i.e., it will scale down when demand lowers) or put it in performance mode, which means it won’t ever scale down.  

You can also continue to deploy your applications on more clusters and namespaces in order to increase capacity or have them be used for disaster recovery. Policies for the GNS in Tanzu Service Mesh will automatically be applied to these services as you scale out your infrastructure and add more namespaces. This video walks you through these new capabilities.

Stay tuned for our next post, which will cover a real-world scenario in which Tanzu Service Mesh SLOs and autoscaling policies are deployed and managed.