This blogpost has been co-authored by Rishi Sharda.
Businesses today are run on a very complex hybrid web of public and private infrastructure services, platforms, middleware, and container-based applications to respond to the agility demanded by customers. While this is great for customers it poses a very real risk and not to mention a lot of work and headaches for the business operators. To help with these operational complexities and give you deeper insights into problems when they occur as well as catch them before they become a problem, we need Observability (aka Aria Operations for Applications or Aria Apps formerly Wavefront) to keep tabs on all of it.
As a business you want to keep your eye on four major aspects of your infrastructure:
- Health of your public/private cloud infrastructure and services
- Health of your Kubernetes platform
- Health of your application
- Health of your business running through that application
Most CXOs, Application owners, SREs and infrastructure teams think this to be an impossible goal to achieve from a single platform and requires months of time and money on a variety of services to get it right. While that is true if you are still using traditional tools, it is less than a week’s work for a platform like Aria Operations for Applications by VMware.
While you may think that this is a tall claim from a virtualization company … we endeavor to demonstrate our claims via series of blogs where we will show you how to achieve visibility from all the layers of your infrastructure and be able to present views like what we have below for your CXOs:
Let us look at this in a bit more detail:
- Indicator layer: The top layer gives an indication of the overall health of each of the layers in your infrastructure. Based on the customizable thresholds it can change color quickly indicating the state of the layer.
Business Insights: The section just below it shows overall indicators that help with key business areas, for example, here it is showing payment information. It highlights the Total Payments made, what percentage of them were successful and a split between successful and failed payments.
Key Aspect Details:
Application Services Health: This section gives an overview of services which make up your application and show how each of them is performing to your customers request in real time.
Kubernetes Pods Health: This shows us the health of k8s pods which support our services and how they are performing with respect to key aspects like CPU, memory etc.
AWS services health: In this section you will be able to quickly identify performance problems with the AWS services.
Business trends and spends: This business KPI section is giving key indicators like the amount of AWS spend this month, is the business doing better or worse compared to last week/month/quarter etc.
The next part of the dashboard involves troubleshooting. For this particular dashboard we have tried to map out the customer journey as they use the application in a subscription scenario. Working with the engineering team, we came up with a list of URL’s or API’s that they’re interested in monitoring, and we wanted to map out the journey in terms of these URLs for the subscription scenario.
Overall subscription has a set of 9 steps in this scenario and we’ve highlighted 3 aspects along with the application map to the right showing how these services are communicating with each other.
What we’ve highlighted here for the 3 aspects are the number of requests coming in for each API, how many of them ended up in errors and what are the delays. You can check how slow or fast these API’s are responding using the stats here, which will clearly show which API’s are having higher number of errors during execution or responding very slowly. This way you can break down the journey to see where the bottlenecks are, or where the drop offs are happening.
You can also draw some conclusions about the subscription process like … about 2,000 people started a particular subscription, but only 94 people actually completed upto the final payment, so what went wrong for the other 1900 or so subscriptions … this is a high number, so conclusions can range from a faulty procedure, slow response times, needs to change payment plan options or use different or better payment gateways etc.
These numbers give you insights into the customer’s journey and what and how much needs to change at your end to help customers use your application and/or services effectively to create a win-win situation for you and the customer alike.
The idea is to give you an easy way to troubleshoot and get to the root cause very very quickly by giving you a holistic map of your process, your application and the interaction of the services in the application.
Interactions between services and components are color coded by thresholds, anything between 2 and 5% is marked as amber, and anything above 5% of error is marked as red.
If you feel the need to go into detail to a trace, you can simply click on this and get to the actual trace of this application. Aria Apps makes it easy and highlights with the yellow line or an orange line going through it so you can see the longest path in the whole call stack and allows you to get to the logs very easily. Log messages show you why it took a long time, on digging deeper you find that is because there is actually an internal server error that happened. This helps us to correlate logs with the actual request failure to quickly see what is the reason of the failure.
As you can see all your observability pillars are covered in a single platform. You can share these finding with the dev team and they can get to the root cause of the issue much faster and more efficiently.
Essentially you build a business view and from there on you can drill down into any aspect just like in the subscription example above. So what are we waiting for … Let’s begin!
You can also learn more about other VMware Aria Operations for Applications features on TechZone !