By: Brian Danaher, Director of IT Architecture and Tools, Jois Raghavendra B P, Senior Manager, IT Tools and Ravishankar Rao, Enterprise Monitoring Tools Architect
As VMware continued to invest in SaaS and build out their SaaS technology stacks, their IT team was facing new challenges: timely identification of anomalous behavior and faster fault isolation. The key to addressing them was to have availability and performance metrics of critical applications and services collated onto a centralized platform.
The interconnected layers of applications in stacks, and their dependent services, required continuous monitoring to ensure issues are accurately identified and addressed quickly. Otherwise, the IT team faced a possible cascading effect of service failures.
Time for Better Tools to Manage Risks
The team knew it was time to take steps to ensure continuous monitoring and rapidly address new issues – that did not require SME’s to review different monitoring dashboards 24/7. There had to be a better way to ensure high levels of system availability and performance – specifically how they monitored and observed their application stacks. They needed tools to enhance monitoring and observability, and provide better insights than their manual, monitoring practices.
While there were multiple tools to address anomaly identification, correlation and hence, faster fault isolation remained a key challenge, given the complexity and inter-dependency of systems and application stacks. The inter-dependency meant an issue in the downstream system could have repercussions and a cascading effect, on the upstream. For example, a network blip can escalate to multiple application failures, or one node in a cluster exhibiting high CPU utilization need not be construed as an application outage.
Defining Solution Requirements
The first question the VMware team had to ask themselves was “what are the key metrics required for optimal monitoring of each layer of their application stacks?” While there were many key transactions, not all needed to be displayed on a dashboard view of real-time system performance.
The VMware IT support teams needed a centralized platform:
- With an end-to-end view of key business transactions or workflows
- Ability to visualize the KPIs of key business transactions and dependent services
It was also critical to identify performance bottlenecks before they became major incidents. Finally, the ability to provide a quick visual indicator of the state of an entire business transaction would aid quicker fault identification.
Solution Selection and Implementation
After researching monitoring solutions, the IT team selected VMware Tanzu Observability by Wavefront a proven solution and was flexible enough to work with existing tools through available integrations.
VMware took a phased approach in implementing Tanzu Observability. Starting with the top-level view, they ingested relevant transaction KPIs into Tanzu Observability, then moved down the stack until the desired ‘end-to-end’ view was achieved.
Results and Best Practices
The team built a dashboard consisting of KPIs to enable an end-to-end transaction view. This dashboard provided visual cues from the entire application stack that the critical transaction cut across.
By collating predetermined metrics from across the stack, the team views key metrics on the Tanzu Observability ‘landing-pane’ that is easy to understand. They can identify anomalies, normalize disparate metrics, visually assess if service is impacted, and find the root cause, without switching between tools.
Finally, Tanzu Observability reduced duplicate monitoring which improved resource utilization and optimized costs.
There was an unexpected result. A few weeks later, one of the key applications failed over to Disaster Recovery. While the support teams were observing services failed over, they noticed an unexpected behavior. An underlying service continued to submit requests to the primary site instead of the Disaster Recovery site. Thanks to the number of integrations available in Tanzu Observability, granular visibility into the request flow could be accurately captured and represented.
Customization is essential to meet specific requirements. VMware’s IT team recommends these best practices when deploying Tanzu Observability:
- Prefix and tag metrics to differentiate data getting ingested – e.g. tag all metrics from a specific service/layer with the same name.
- Restrict the number of key metrics for better correlation.
- Leverage functions, such as moving average, to provide a more current view of the metric state.
- Load-balance the Tanzu Observability proxies to ensure data ingestion is highly available and load-balanced.
The VMware IT team continues to fine-tune their dashboard as they gain additional experience with Tanzu Observability.