Achieving Application Resiliency via VMware Tanzu Service Mesh and AWS Route 53

Service Mesh is quickly becoming a fact of life for modern apps, and many companies are choosing this method for their distributed micro-services communications. While most examples of service mesh focus only on the east-west aspect of app services communications and security, Tanzu Service Mesh aims at including the entire application transaction which includes both east-west as well as north-south communications in the mesh.

In previous blogs and articles (here and here ), we dug into the core construct of the system, called Global Namespace (GNS). GNS is the instantiation of application connectivity patterns and services. In the case we are describing here, one of these services consists of “northbound” access to the application in a resilient configuration through integration with a Global Server Load Balancing (GSLB) solution. In the current version of the service, we support the following integrations:

VMware NSX-ALB (aka avi networks) – VMware’s own complete software load balancing solution.
AWS Route 53 – AWS DNS service providing GSLB services for resiliency. This is useful for customers who do not own NSX-ALB.

In this first blog, we’ll describe how the solution works with AWS Route 53 and how to configure it. In a later post, we’ll do the same with the NSX-ALB load balancer.

First, let’s touch on why you would want to even set up your application’s public access and availability using service mesh. There are a few reasons:

User access is not someone else’s problem

Service mesh solutions, as the name implies, focus on services communicating with other services, but the application transaction actually starts with a user who is accessing a published service. That service in turn accesses other services as needed, and there is usually persistent data being accessed as well.

(Persistent data is an integral part of the app, and we at VMware think that “the mesh” should address it as well, but that conversation is beyond the scope of this post.)

In terms of user access, let’s examine the northbound access to which a service is published, usually using a GSLB solution that provides health checks and failover capabilities for resiliency purposes, as seen in the following diagram:

When organizations treat application publishing and resiliency as an out-of-band process from the application itself, they make the publication of an app a disruptive and time-consuming process, both from a workflow point of view and technologically. That means that even if we are super agile in our dev and devOps practices, we’re nevertheless introducing a process that relies on tickets and handoffs — which ultimately means time wasted waiting for public access to be set.

On the other hand, by tying public access into the service mesh, we not only provision resilient access to our application faster and more accurately — remember no handoffs — but we can also add more value on top of it, with benefits like controlling failovers based on cascading problems within our application chain that may not be visible through traditional GSLBs but are visible to the service mesh.

This is why TSM is integrating GSLB capabilities straight into the mesh: to provide TSM users with a streamlined workflow to publish applications through a resiliency service while taking care of configuration, registration, and certificate assignment for users.

In addition, there are direct availability advantages for this type of integration. For example, we can reduce downtime for cases where the published service is down but the cluster and gateways are still up. GSLB services have a timeout specified that until reached the GSLB will send traffic to the endpoint even if the published service is down sometimes refers to as “blackholed traffic”, this timeout for example in AWS Route 53 is 60 seconds by default, which means that if the published service is down there is potentially 60 seconds where the traffic will be “blackholed”. With TSM, the ingress gateway running on the destination cluster will detect the failure first and will shift traffic to the other cluster while the GSLB catches up.

This would, of course, only work when the cluster and the gateways themselves are not down and only the published app services are down.

In the event the cluster or gateway are also down, it is considered a full deployment failure, and a site failure would be initiated by the GSLB.

Let’s now look at how to set this up with AWS Route 53.

Making it work

With all automation, there’s always a bit of prep work to set things up, and then every subsequent run becomes simple. Similarly, here we need to set up the integration and, once it is stood up, publishing an app becomes easy.

What do you need to make this work?

We need a TSM org on which we will initially set up the integration and configure a GNS.
We also need an AWS account and a user with privileges to configure Route 53. To configure a user in AWS, see this.

I suggest also having a designated DNS zone for the GNS deployments so that they are separate from your main zone.

The first step is to set up the integration. In TSM, go to Tanzu Admin > Integrations, and click “Edit” on AWS.

In the next screen, enter a name for this integration, and paste your AWS user access key ID and secret access key.

If you would like to configure your public services to be accessed with HTTPS, you will need to import a certificate. You can add a specific “vanity URL” certificate that you choose for each public service in the next steps of the GNS configuration, or you can import a wildcard certificate such as *.domain to be used by all public services that are part of that domain. To import a certificate, go to Tanzu Admin > Keys & Certificates.

Click “Add New Certificate”. In the “New certificate page”, provide a name for the cert and import the cert file and private key.

That’s it! Your system is ready to configure public services.

Now, to configure public service in a GNS:

In the GNS creation page, after adding services, we reach the “Public services”. Click “Configure public services”.
Next, select the service to be published — that is, the service you want to send traffic to. In the screenshot below, I chose the “shopping service”. This means that anywhere this service exists, Route 53 will configure an entry. Even if we add a new site and deploy the shopping to it, as soon as we add that to the GNS, it will detect the change and configure Route 53 automatically to send traffic to it, supporting automatic “Cloud bursting”.
Once we select the service, the internal “service port” will be automatically configured. This is the port the ingress gateway on the local cluster uses to communicate with the internal service.
Next, we choose the protocol. If HTTP is selected, then you also need to choose the certificate you configured in the previous section.
Lastly, we need to set the URL from which the service is accessed. Input the prefix and the system will show the zones that the user used for the integration and is allowed to configure.
At this point, you can configure more public services or click Next.

In the “Health checks” page, click “Configure Service Health Checks” and then “Default TSM Health Checks”. (We will add more control over health checks in coming releases.)

That’s all there is to it.

At this point, TSM will go to Route 53 and configure a Round Robin load balancing (equal weights) with entries for all sites where the published service exists. The application will be accessible from the URL you configured. This eliminates weeks of work and handoffs between teams to set up public access while allowing for the dynamic and automatic expansion of the mesh from both east/west AND north-south.

In the next phases of Service Mesh, we’ll continue to add more sophisticated load balancing policies and more intelligent health checks directly from the TSM GNS page.

I hope this helps see how VMware Tanzu Service Mesh is making the process of publishing an app easier and faster. The next parts of the blog series will look at additional ways you can use Tanzu Service Mesh to make your applications more resilient.

Achieving Application Resiliency via VMware Tanzu Service Mesh and AWS Route 53

User access is not someone else’s problem

Making it work

Comments

Promo

Trending Topics