Building Self-healing Load Balancing Services

The cloud is all about redundancy and fault-tolerance, and in infrastructure development, a given system’s ability to tolerate failures while still ensuring adequate quality of service, often generalized as resiliency, is typically specified as a requirement. However, no single component can guarantee 100% uptime. Even the most expensive hardware can fail, power outages can cause entire datacenters to go offline and accidental misconfiguration and/or accidental shutdown/disconnect of components can cause service disruptions.

Most of these challenges can be overcome by designing a fault tolerant network with multi availability zones, disaster recovery DC’s and high availability infrastructure. However, that is expensive in operational and capital expenditure due deployment of infrastructure components that sit idle 99.99% of the time, the staff needed to maintain that idle infrastructure and still doesn’t guarantee that services are not disrupted. To build a truly fault tolerant network that is cost effective, the network must scale up and down based on demand and fix itself if something breaks.

With VMware NSX Advanced Load Balancer (formerly Avi Networks) distributed cloud architecture individual components can fail without affecting the availability of the services delivered by automatically fixing itself. For example, one customer, an eCommerce provider, accidentally turned off a set of Avi Service Engines causing them to go offline and because of NSX ALB’s automated failure recovery they didn’t realize that the Service Engines had failed until they reviewed logs days later for a different reason.

So, what caused them not to realize that a few Avi Service Engines were disconnected by accident? The short answer is: When the Avi Controller detects that a Service Engine is down it moves all virtual services to an available Service Engine with sufficient capacity or spins up a new Service Engine. And the same is true if the Avi Controller detects when a Service Engine experiences resource exhaustion. To accommodate increased demand, the Controller can automatically spin up additional Service Engines to ensure consistent end-user experience. Now in contrast, with legacy load balancers an accidental disconnect, or shutdown would result in total loss of application access or in severe service degradation, not to mention that there is no automated up and down scale based on demand.

Figure 1: NSX ALB self-healing architecture

Automated Failure Recovery utilizes three detection methods.

Controller-to- Service Engine Failure Detection:

In all deployment modes, the Avi Controller sends heartbeat messages to all Service Engines in all groups under its control. If there is no response from a specific Service Engine for six consecutive heartbeat messages, the Controller concludes that the Service Engine is down, and moves all virtual services to an available Service Engine with sufficient capacity or spins up a new Service Engine.

Service Engineto-Service Engine Failure Detection Method:

In the above-mentioned Controller-to- Service Engine failure detection method, the Controller detects a Service Engine failure by sending periodic heartbeat messages over the management interface. However, this method will not detect datapath failures for the data interfaces on Service Engines. To ensure holistic failure detection, Service Engines send periodic heartbeat messages over the data interfaces and if the Controller concludes that a Service Engine is down it moves all virtual services to an available Service Engine with sufficient capacity or spins up a new Service Engine.

BGP-Router-to- Service Engine Failure Detection Method:

With BGP configured, the Service Engine -to- Service Engine failure detection is augmented by Bidirectional Forwarding Detection (BFD), which detects SE failures and prompts the router not to use the route to the failed SE for flow load balancing and by using BGP protocol timers, as well.

Increasing Capacity and Rebalancing

But to have a truly resilient network, automated failure recovery alone isn’t enough, addressing performance in form of automatically increasing capacity or automatically redistributing resources when needed is just as much a crucial part. While performing application delivery tasks, Service Engines could experience resource exhaustion. This may be due to the deployment of a new application, high CPU or memory utilization, or traffic patterns. Monitoring several application and network telemetry real-time from the Service Engine, the Avi Controller can automatically migrate a virtual service to an under-utilized Service Engine or scale out the virtual service across multiple Service Engines across Multi Region and/or Multi Availability Zone deployments to increase capacity. This allows multiple active Service Engines to concurrently share the workload of a single virtual service.

In addition, NSX ALB learns application access patterns and can perform intelligent, predictive autoscaling based on the learned traffic patterns and application usage, making services highly available before demand causes any service exhaustion or disruption.

Figure 2: NSX ALB Auto rebalance and capacity adjustment

Automated Load Distribution

With NSX ALB, enterprises can move workloads across multiple clouds effortlessly. NSX ALB enables enterprises to use AWS/GCP/Azure as a natural extension to their data centers by automatically overflowing to the cloud during traffic peaks. NSX ALB can automatically create app resources in public clouds to absorb traffic surges and scale them back down. For operational automation NSX ALB is natively integrated with the surrounding ecosystem enabling firewalls to automatically initiate, IP addressing, and DNS automatically configured with Infoblox, Amazon RAW 53, etc., but can also provide some customization through REST APIs.

Figure 3: NSX ALB Ecosystem Integration

NSX ALB PULSE – Cloud Services for Support and Security

While the system self-heals, it doesn’t have to go unnoticed – VMware NSX ALB PULSE services provide an automated and central facility for a globally distributed set of Avi controller clusters that can help automatically submit a support case to do a post event analysis to understand the reason for the failure.

Any Avi Controller within any cluster can connect to the NSX ALB PULSE services to retrieve the latest threat intelligence and automatically contact VMware support if a system failure is detected.

These new services provided by NSX ALB PULSE will reduce the complexity, management, and cost of your network operations. Reduce risk with automated compliance and remediation services, increase business agility, and accelerate the digital transformation of your business.

Automated Support Case Management

In most enterprise monitoring systems are in place to alert if a failure or a problem occurs within the network. The challenge with this legacy approach is that someone has to log on to the failed system, download support bundles, download log files and traces (possibly from multiple different systems that are only accessible via jump boxes), contact customer support, create a case, upload the support files to customer support and then wait for the support engineer to figure out what went wrong. This process is very time consuming and error prone.

NSX ALB PULSE Services includes automated support case management to eliminate these challenges giving operators the choice to automatically create a support case in case of any failure.

Upon fault notification the administrator simply creates a support case directly from any controller within the cluster. The selected controller will collect all ecosystem meta-data such as software versions, connected service engines, etc., create a tech support bundle with all logs and traces, and automatically create a tech support case, fill out all information from the collected meta data, and upload the tech support bundle to the created case. This reduces the time to create a support case with all relevant information from hours to minutes.

Alternatively, with NSX ALB PULSE automated case creation, if a failure occurs, any controller within the cluster will collect all eco-system meta-data, create a tech support bundle, and automatically create a tech support case, without any involvement from an administrator. Upon case creation VMware tech support will start working the case immediately and notify the customer, possibly with a solution.

NSX ALB delivers a highly reliable distributed cloud architecture but if any individual components fail, it does not affect the availability of services delivered by the system. Modern infrastructure should support these reliability and self-healing capabilities by default – it is no surprise that enterprises are stopping their reliance on purpose-built appliances that are too important to fail. Join me on a webinar tomorrow when I will share a more details at a live demo on this topic. The webinar will also be available for on-demand viewing. Check out https://info.avinetworks.com/webinars/resilient-load-balancing

Increasing Capacity and Rebalancing

Automated Load Distribution

NSX ALB PULSE – Cloud Services for Support and Security

Automated Support Case Management

Related Articles

Navigating VMware Explore 2026: Key Sessions for Avi

Operationalizing VMware Avi Load Balancer with VCF 9.1

Avi Advantage: Automating Certificate Management of VCF Workloads

Avi Innovations for VCF 9.1: Powering Kubernetes, Agentic AI and VPC Workloads

Avi Load Balancer Analytics: Root Cause Application Performance Issues in Minutes