VMware Edge Network Intelligence Network visibility SD-WAN Security Technical

VMware Edge Network Intelligence: Use AIOps to Help SD-WAN Infrastructures Self-Heal

11/7/2023: VMware Edge Network Intelligence is now VMware Edge Intelligence! Announced at VMware Explore Barcelona, Intelligent Assist for VMware Software-Defined Edge will bring VMware Edge Intelligence together with generative AI to provide intelligent remediation and security for both OT and IT environments.
Read our blog post to learn more.


Updated 10/18/2022

Part 3: Creating a Self-Driving Network for Superior Performance, Security, Availability and Efficiency

Note: This is the third in a four-part series about how VMware Edge Network Intelligence™ enables better insights for IT into client device experience and client behavior.

Make network management simpler and more effective

The first post in this series offered an overview on how AIOps provides companies with actionable information, including performance baselines, gleaned from the massive data generated by a global SD-WAN network. The second post details how additional data sources created along the client journey from device to application provide the context that helps network engineers quickly fix network problems at branches before they hamper business operations.

This third post shows you how VMware SD-WAN best-of breed-self-healing functionality becomes even more effective when powered by VMware Edge Network Intelligence data analytics. This combination plays a key role in network IT teams managing their network responsibilities more effectively and efficiently.

Innovative self-healing functionality

The VMware SD-WAN platform typically performs self-healing by reacting to changes in the network based on local information and guided by business policy and intent. For example, the VMware SD-WAN Edge constantly measures the health of all the tunnels it creates. Whether to other sites, edges, or even gateways in the cloud, it analyzes tunnel health, helping the platform react when it detects any issues. To maintain overall network quality, the system then switches to another WAN link, adds redundancy, or leverages multiple paths in a clever way.

In short, the SD-WAN network today already self-heals based on local information provided over short periods of time. A single edge measures its own tunnels, determines that the tunnel health is degrading over time, and reacts to the problem. This functionality happens on an application or packet basis based on the overall tunnel health. What AIOps brings to the table from a self-healing context is its ability to incorporate global information over much longer periods of time.

AIOps provides the global information to make self-healing more effective

The last blog post discussed how using additional data sources fed into AIOps makes managing SD-WAN easier. These sources include data from the network itself or from other vantage points along the client to application journey, such as wireless devices, switches, network services and the SD-WAN edges, gateways and hubs.  Additionally, anonymized insights from other VMware SD-WAN enterprise networks are fed into VMware Edge Network Intelligence. All this information helps the SD-WAN platform detect patterns to actually locate the source of any network problem.

Adding self-healing to the equation means that AIOps essentially tells the SD-WAN network and even other controllable networks to take action to fix these global issues. Using a larger data set sampled over a longer period of time provides an additional layer of meaningful analysis. In addition to measuring and reacting to tunnel health, AIOps provides the ability to separate out every single application, analyzing each app’s performance individually. And that’s only the start.

For example, if tunnel health is fine but application performance over that tunnel remains poor, perhaps it’s an issue with certain applications requiring a special kind of remediation. A short-term fix won’t solve the issue over time. Fortunately, this approach allows the system to analyze trends, determine any emerging patterns, and actually program the ability to mitigate the problem on an application basis using the global information provided by AIOps. Analyzing the efficacy of fixes performed on other VMware SD-WAN networks is part of this process.

Self-healing scenario examples enabled by self-reinforcement machine learning, centralized architecture and VMware SD-WAN Gateways

Fault isolation-based self-healing

Let’s dive in deeper and look at specific examples of actually performing self-remediation. The first example is fault isolation-based, which at a high level involves determining where in the network a problem exists when application performance suffers. Is the issue with the client-side LAN, the WAN, data center, cloud, or even with the application itself? This approach involves determining where the faulty segment is. For example, if the faulty segment is in the data center, the SD-WAN network can simply steer traffic to a different data center.

On the other hand, if the faulty segment is in the WAN, then the network takes mitigating actions within the WAN itself, such as prioritizing traffic a certain way, or even computing a different route. Similarly, if the problem is in the client LAN, perhaps it’s not a problem that can be self-healed by the SD-WAN network. In this case, the client or their network provider needs to take actions to determine and fix the issue.

In this particular example, VMware SD-WAN with VMware Edge Network Intelligence self-heals the network by first analyzing all this data, and then simply isolating where the fault is: client LAN, WAN, data center, or cloud. Based on the determination of location, the network tries to steer around the problem, takes some mitigating action, or alerts if the issue can’t be solved by the SD-WAN network.

That covers the fault isolation-based analysis and the subsequent self-healing based on SD-WAN data. By looking at the same application at different vantage points, such as the edge, gateway, or data center hub, fault isolation can be performed after feeding that data through an AIOps platform.

Root cause-based self-healing

The next method of self-healing is root cause-based. In root cause-based analysis, the AIOps platform tries to determine a specific root cause to any network problem. For example, Zoom performance is bad on the network at a particular site because of a poor wireless signal caused by 5 GHz co-channel interference.

The ultimate goal is getting to a very specific root cause, where either the system tries to program the wireless LAN network itself, or decides that the issue can’t be mitigated in the WAN. In the latter case, the system can offer insight on the most likely root cause, which in practice can be the most time-consuming part of solving the issue.

An example of root cause-based analysis that allows for mitigation in the WAN is poor Zoom performance due to packets being dropped because of queuing inside the edge appliance or gateway. This could be fixed by prioritizing Zoom traffic higher than other applications sent over the same link.

Distributing traffic across multiple gateways, as opposed to sending all traffic to the same gateway, is another option. The system is trying to go beyond just saying that the WAN segment is faulty; it’s trying to get to an exact root cause explicitly describing the issue. Simply reversing the root cause the system detected allows for automated remediation.

In another example, inventory management software performance could be bad because of Facebook traffic on the same link. The system could solve the problem by capping the amount of traffic for Facebook on that link or rate-limiting Facebook usage. Either approach solves the problem automatically.

Reinforcement-based self-healing

The third approach to self-healing is called reinforcement-based. This methodology uses a type of machine learning technique for remediation called reinforcement learning. In this case, the system doesn’t determine an exact fault root cause for a network issue. It just knows that there’s a problem, but without knowing its exact nature.

At this point, the system begins testing different fixes it thinks are the most likely to work. It then leverages an important feedback mechanism to determine their effectiveness. So even in the fault isolation or the root cause self-healing methods, the fact the system automatically knows the network performance baseline becomes a very critical element. This baseline provides the feedback mechanism to know whether the remediation actually worked.

When applying the reinforcement-based model, the system uses the baseline as a reward function at each step of the process. This effort tries to drive the solution towards the optimal solution in terms of the right remediation approach.

For example, there’s poor Microsoft 365 performance on a particular network.  The system can “explore” different approaches (e.g. re-prioritize traffic, steer along a different path, etc.), and then can “exploit” the particular approach that seems most promising (i.e. improve the baseline the most).

The system continues in this way until it is able to find something that actually works. However, all those failed attempts aren’t wasted, because the machine learning models teaches it what the best remediation method is for that scenario. Note that this can work either using deep domain knowledge, or in a domain-agnostic fashion. Therefore the reinforcement-based approach depends on a tight feedback mechanism in order to ultimately determine the cause of the issue and its right fix.

Self-healing information security problems

In addition to proactively improving network performance, self-healing also applies to fixing potential security issues. For example, the system could automatically detect that point of sale devices, which need to be PCI compliant, are on the same network segment as iPhones and other user traffic. This is a violation of PCI rules. Therefore, these devices need to be “healed” by being automatically placed in their own micro-segment.

Another example is detecting that certain devices are behaving abnormally by accessing higher risk destinations. The system proactively handles the problem by automatically programming a rule that denies access to these devices. While the VMware SD-WAN platform paired with VMware Edge Network Intelligence provides self-healing functionality from a performance standpoint, it also takes a similar approach from the perspective of security policies. This approach ties in with the overall VMware vision for providing an SD-WAN network with both superior performance as well as state of the art SASE and security services at the edge.

AIOps makes network administrators more efficient

With all of these self-healing mechanisms available to the SD-WAN platform to automatically fix network issues, IT administrators might have different levels of comfort about this automated technique. At the minimum, they want to be informed when the system automatically took any action. This reporting is also important when determining ROI on the SD-WAN solution that includes an AIOps platform with self-healing functionality.

One possibility from an overarching perspective is for the system to notify ITOps about any recommended actions. This approach provides a prompt to the network engineer; they can judge whether they’re comfortable with the system taking a particular action before it happens.

Another option automates the self-healing action and annotates the baseline with information about that action. ITOps receives a notification about the automated action, as well as details about the subsequent network performance. The network engineer has the option of whether or not to allow similar automated actions in the future. This potentially prevents calls in the middle of the night about any successful self-healing actions to the network.

Ultimately, investing in a SD-WAN platform becomes even wiser when combining it with state of the art AIOps functionality.