Digital transformation has changed the way applications are deployed and consumed. The end-user to application journey has become increasingly complex and is a key objective for the Modern Network.  End-users are more distributed, and applications run on heterogenous infrastructure often delivered from on-prem data centers, IaaS, SaaS, and public cloud locations.  On average, enterprises use hundreds of applications.  The number of end-user and IoT devices have also increased exponentially. They include infusion pumps in hospitals to Point of Sale systems in retail.  These devices access applications from manufacturing floor, carpeted offices, homes or while users are on the move. As more devices and applications are enabled, the network increases in both complexity and value to the enterprise.

What has become increasingly clear is the need for advanced self-healing solutions that compensate for this complexity by helping IT teams shift to a proactive mode of operating a network.  Several tools exist that provide domain or service-specific insights, but it is left to the IT teams to make sense of the volumes of data generated by these fragmented solutions to detect issues and perform root cause analysis.  The dynamic nature of the network, device density, and the volume of data and transactions generated render existing solutions ineffective.  IT teams are in a perpetually reactive mode to manage network operations.  In addition, the velocity of new applications and application updates is accelerating the need for automated application delivery and self-service for application owners to provision application networking services.

Insights and Auto-Remediation: Foundations of a Self-healing Network

The end-user application experience depends on several factors. One important consideration is the number of infrastructure segments and connections traversed when accessing network services.  Today’s networks also have layers of encapsulation with both an underlay and overlay view, making network visibility extremely important yet complex.  Understanding performance issues visible from latency, jitter, or retransmits requires an end-to-end view.

Here’s a 4-step framework that provides comprehensive end-to-end data correlation and analysis across campus and branch networks, data centers, and cloud.

Comprehensive End-to-End Data Correlation and Analysis


  • Measure – As the average number of software apps deployed by the Enterprise increases, gathering metrics across the network and collecting them at high fidelity and data granularity becomes critical. Furthermore, modern distributed applications have resulted in a radical shift in data center traffic patterns. North-south traffic has been replaced largely by east-west traffic, which demands changes in security and monitoring measures.  This is the first step toward proactively monitoring and reviewing user and device experience.
  • Baseline – With adequate measuring and monitoring mechanism in place, the next step is to continuously analyzes network activity to baseline the normal network behavior and user/device experience. The end user Quality of Experience (QoE) is used to determine how well the network is serving the end user.  This becomes the baseline behavior which can then be used along with advanced algorithms to detect anomalous behavior in application performance and security.  Any deviation from the normal, expected behavior is tagged as an anomaly for further inspection.
  • Recommend – With the baseline and anomaly detection in place, the next step is to create real-time correlation across the application and infrastructure stack. Advanced data models enable AI/ML analytics, including event correlation, anomaly detection, and RCA across all supported data sources and stacks.
  • Remediate – The solution should also, where appropriate, take corrective action when performance issues are observed. For example, if an application receives higher than expected traffic and the end user experiences slow response times above some AI/ML learned threshold, the system automatically scales load balancing capacity and triggers the autoscaling of the backend application servers to reduce app response times.

Enterprise IT also needs coherent and deep visibility into inter-dependent contributing factors.  As an example, IT can understand and characterize user experience by gaining visibility into the round-trip time required to access an application.  If the latency at every major hop in the network from the end user device to the backend application is known, network administrators can eliminate cumbersome network traces and after-the-fact log analysis to understand slow responses and failures.  These services need to be available on a per-app basis and with consistency across any data center or cloud environment.

Delivering end-user application experience

Users expect always-on access from anywhere to applications deployed on any cloud.  Any downtime can lead to productivity loss and business risk.  As IT strives to minimize the risk of failure, they need solutions that can proactively isolate failure segments and offer remediation steps.  Operations and SRE teams, on the other hand, need to identify and remove application bottlenecks while also identifying performance and scaling optimization opportunities.  They also need a way to compare performance across multiple locations.

The VMware Virtual Cloud Network (VCN) enables Enterprises to see the network as an end-to-end environment.  With a comprehensive portfolio of virtualized networking, security, and analytics solutions, VCN addresses the core set of capabilities needed for detection, avoidance, and end-to-end auto-remediation, providing the foundations for self-healing networks.  Let’s look at some examples.

VMware vRealize Network Insight abstracts away many of the complexities and challenges associated with traditional network monitoring tools. The network no longer needs to be built or managed with a box-by-box approach.  The entire environment is seen as an end-to-end system.  vRealize Network Insight starts by using machine learning techniques to discover application boundaries.  This leads to a more comprehensive understanding of and correlation between the complex interactions between applications, network traffic, containers, virtual machines, devices, and end users.

Modern techniques like Formal Verification are also applied, leveraging mathematical modeling to detect issues early in complex systems.  This capability helps the operator understand what is happening in the network and also predict what could happen to the network, taking a bold step towards building networks that essentially manage themselves.  Furthermore, changes in the network compound over time causing deviations to occur, causing failures.  vRealize Network Insight Assurance and Verification continually validates that original business intent is still being followed.  This dynamic check validates best practices and user defined business intents and notifies the operator when remediation is required.

Another example is the VMware Edge Network Intelligence solution which uses machine learning and big data analytics to make sense of the volume of transactions generated when users access applications.  The solution gathers data from a variety of sources to automatically discover the devices on the network, baseline their performance and monitor the baseline for deviation.  When a deviation occurs, the solution performs root cause to determine the fault segment with actionable insights to remediate the problem.  For example, if users at a specific location are having issues with a SaaS application, VMware Edge Network Intelligence isolates the problem to a specific SaaS Point of Presence.  The solution then directs the VMware SD-WAN Orchestrator to take corrective action by offering the choice of next best location to source the application.

One final example is the multi-cloud application services that are part of VMware NSX Advanced Load Balancer.  Administrators no longer need to overprovision load balancing capacity as Active/Standby pairs of hardware appliances to account for sporadic traffic peaks.  NSX Advanced Load Balancer uses a software-defined architecture with central orchestration of a distributed data plane of load balancers that can be scaled out horizontally and on-demand.  The ability to auto-scale across any data center or cloud environment means that administrators can help maintain application SLAs and end-user experience without manual intervention.  Traffic flows are automatically rebalanced across newly scaled load balancer instances while gracefully terminating open connections as necessary.  NSX Advanced Load Balancers is fully integrated with the data center and cloud environments, and is completely programmable, allowing admins to set up triggers to auto-scale backend application instances for handling unexpected traffic increases.  All system responses and application performance can be monitored easily with the analytics and visibility built into the solution.

Here are some additional resources to further explore these topics: