This week at ACM SIGCOMM 2023 in New York City, a team of collaborators from VMware and the University of Illinois at Urbana-Champaign will unveil a research paper on a new learning-based approach to cloud network troubleshooting. The techniques in the paper, presented on Tuesday, September 12, will allow enterprise teams to become more proactive about improving the performance of their distributed cloud applications and network infrastructure.
SIGCOMM is one of the top annual conferences in the field of computer networking research. The paper “Murphy: Performance Diagnosis of Distributed Cloud Applications” describes an automated system for diagnosing performance issues using telemetry from enterprise infrastructure environments. The system uses a machine learning algorithm based on a Markov Random Field (MRF) that can explain how entities affect each other. This mathematical modeling technique, used in computer science, operations research, and other fields, uses known relationships between entities – for example, that an application is running on a certain VM or that two VMs have communicated – and learns from observations about the entities’ metrics over time. This technique allows for a powerful understanding of environments and the ability to model a complex cloud application and network system to help make issue diagnosis predictions more accurately.
Co-authors on the paper are Vipul Harsh (University of Illinois Urbana-Champaign and VMware), Wenxuan Zhou (VMware), Sachin Ashok (University of Illinois Urbana-Champaign), Radhika Niranjan Mysore (VMware Research), Brighten Godfrey (University of Illinois Urbana-Champaign and VMware), and Sujata Banerjee (VMware Research).
Figure 1. From left to right, Sujata Banerjee, Sachin Ashok, Vipul Harsh, and Brighten Godfrey.
“With this technology, operators can automatically reason about anomalous behaviors of system components such as server overload or high response times for client requests, among others,” says Vipul Harsh. “Murphy leverages ideas from the theory of probabilistic graphical models to precisely capture the complex inter-dependencies among system components, which results in a more accurate analysis than alternative approaches.”
“Engineers running large enterprises have multiple issues they must sift through daily to understand distributed applications and infrastructure, which can be laborious,” Brighten Godfrey commented. “This new approach helps identify the key components likely to be underlying problems. In enterprise network monitoring platforms like VMware Aria Operations for Networks, there is a vast amount of time series metric data available about applications, virtual machines, switches, traffic flows, and more, which learning-based approaches can make easier to understand.”
VMware Aria Operations for Networks incorporated elements of this approach in the Network Insights (Beta) feature announced at VMware Explore 2023 (Enterprise Cloud tab, VMware Aria Operations for Networks). Network Insights (Beta) can help with root cause analysis of network traffic, anomalies, and application dependencies.
VMware Group Product Line Manager Ray Belleville mentioned, “Our initial findings with this approach are proving very useful in determining longer chains of causality, which were not possible previously without AI/ML. We’re excited to bring this future innovation to our VMware Aria Operations for Networks customer base.”
The team looks forward to sharing their latest finding with the SIGCOMM community and receiving feedback from other researchers and early adopters of the tool.
Resources
Paper: Murphy: Performance Diagnosis of Distributed Cloud Applications
VMware Aria Operations for Networks webpage
Free 30-day SaaS Trial for VMware Aria Operations for Networks
If you are interested in VMware Research
VMware Puts the Power of Generative AI Within Reach of Any Enterprise