Our recent blog on 5G Mobile Service Assurance highlighted the challenges of cross-domain, multi-layer operations and automation for 5G service assurance and described tools for managing it.
Let’s now discuss in more depth closed-loop operations for 5G service assurance, which involves automating the process of identifying a fault and then remediating it.
The Challenges of Managing Complex 5G Networks
Operations teams face many new challenges when monitoring and managing large and complex cloud-native RAN and Core network environments.
- CSPs struggle to monitor a wide and diverse set of multi-vendor CNFs and mission-critical 5G services.
- Network Operations Center (NOC) teams are overwhelmed with NFVI, CaaS, CNF and Physical Network alarms and are unable to quickly analyze the symptoms and zero in on the most serious issues.
- Information is lacking on which layer of the infrastructure or network element is causing the faults and which services, customers and SLAs are being impacted.
- NOC RAN and Core network teams are dedicated to specific areas of the infrastructure, looking at element and network management system (EMS/NMS) dashboards and reports provided by each vendor; as a result of these silos, there is limited collaboration across teams.
- Thousands of hours are spent manually updating EMS/NMS remediation rules each time a new issue is identified in the network, topology is changed or new service is deployed.
- A huge volume of Change Request and Service Request tickets are created that delay resolution, thus impacting SLAs and response time.
To reduce mean time to resolution and remediation, operators need to be able to seamlessly detect the root cause of service failure or degradation by correlating fault and performance monitoring data collected from various layers of the telco cloud infrastructure. This is particularly complex for 5G service delivery, which involves thousands of unconnected virtual components and cloud applications that are not tied to specific resources.
To address these service assurance complexities CSPs need a multi-layer automated assurance platform that can give them not just near-real-time visibility and deep actionable insights, but also automated closed-loop operations and remediation capabilities. To meet SLA and service quality requirements, the platform must provide an end-to-end view across the infrastructure, orchestration and service layers.
How VMware Telco Cloud Operations Improves Mean Time to Resolution
The VMware Telco Cloud Platform, our recent award-winning 5G Telco cloud native platform, meets the 5G service assurance needs of operators. It is integrated with VMware Telco Cloud Operations and VMware Telco Cloud Automation to monitor the multi-layer telco cloud infrastructure and provide automated root cause and impact analysis, as well as remediation. The result is reduced CSP cost and increased operational efficiency.
Automated Root Cause Analysis
VMware Telco Cloud Operations’ root cause and impact analysis capabilities bring efficiency to network operations by automatically correlating symptoms from the many layers of the infrastructure stack (physical, virtual, Kubernetes, CNFs, VNFs and services) and pinpointing the problem’s root cause. It then notifies operations of the issue and its associated impacts.
The figure shows how Telco Cloud Operations detects the root cause of a physical infrastructure problem. The Telco Cloud Operations Root Cause Analysis Engine correlates the symptoms across the infrastructure layers. The Port Down fault impacts host connectivity, causing malfunction of 5G infrastructure, VM, Kubernetes and CNF layers, and eventually impairing the gNodeB service delivered on the infrastructure.
Figure: Telco Cloud Operations NOC Notification Log View
As shown in the figure above, the Notification Log View enables the NOC operations team to pinpoint root cause. “Port Down” is the root cause. Symptoms of the root cause are shown as: Host Unresponsive, Pod Failed, CNF Impaired, gNodeB Service Impaired.
Figure: Telco Cloud Operations NOC Topology Map View
At the same time, the NOC team can visualize the impact of the issue in a topology map as shown above. The topology map shows the Port Down issue and the health of various layers of the infrastructure, including services.
Providing real-time visibility and actionable insights into faults are not enough. To meet SLA and service quality expectations in real time requires automated remediation across the infrastructure, orchestration and service layers.
VMware Telco Cloud Operations provides a remediation policy framework that automates the processes and procedures for common NOC faults that can be handled without human involvement. Administrators can define policies to allow automatic remediation actions when specific infrastructure faults occur that affect service. Different automated remediation actions can be taken based upon the duration of the problem. The framework also manages the remediation process by providing alerts and notifications, dispatch for repair, etc.
Site Router Remediation Example
The figures below illustrate an automated policy configured by an administrator to remediate a critical “Site Router Down” failure in a regional data center. The root cause analysis detects the issue in the physical infrastructure and, if it’s down for more than a minute, Telco Cloud Operations automatically reboots the router.
- As shown in above figure, the Telco Cloud Operations remediation policy is a collection of various levels (Level 0, Level 1, etc.) that is activated based on a specific root cause notification (Router Down).
- Each level in the path is associated with a specific set of actions (e.g., Check Router Down, Reboot Router, Open Ticket).
- Each level can specify a time period and an action or set of actions to be executed.
Kubernetes CU Node Remediation Example
In this remediation policy example the remediation action is determined based on the underlying root cause.
For example, a Centralized Unit (CU) Node failure may be due to the root cause failure of an underlying connected router or VM/Hypervisor. The administrator can configure a remediation policy to identify the root cause of the CU node failure and the actions required to remediate the root cause.
- If the root cause of a CU node failure is because of a Router Down, restore the connectivity by triggering an automated remediation to restart the router to bring the CU node to a healthy state. This is shown in level 3 in above figure.
- If the root cause of the CU node failure is Hypervisor Down, trigger an automated remediation to restart the hypervisor or host by triggering a Telco Cloud Automation workflow.
VMware Telco Cloud Operations Automates 5G Assurance
VMware Telco Cloud Platform enables CSPs to automate the critical 5G service assurance tasks of root cause analysis and remediation, thus substantially reducing error-prone and time-consuming manual tasks.
It does this by automatically
- Correlating symptoms across the infrastructure stack layers
- Identifying the problem’s root cause
- Notifying operations of the issue and its associated impacts
- Restoring service by restarting applications, initiating failover processes or rerouting services around failed components
- Notifying customers of service outages and providing status updates
- Dispatching maintenance staff with a description of the infrastructure failures impacting the service
- Alerting successive layers of management of unresolved problems to ensure that they receive the proper level of attention
- Establishing remediation policies and escalation based upon the duration of the notification
To learn more about VMware Telco Cloud Operations, you can visit our website.
If you are attending MWC Barcelona, please reach out to our sales representative to schedule a tour. Don’t forget to check out our VMware Telco Cloud Operations demo at booth #3M11.