The Service Mesh Interface (SMI) is a popular and valuable effort to create a common, portable set of APIs that allow applications to address multiple service mesh solutions. Although Network Service Mesh (NSM) differs from other service mesh offerings in the networking level it addresses, the NSM team decided that it would be helpful if NSM was also compatible with SMI, so we set about integrating the two.
I want to share some of what we learned as a result of that effort, including finding some issues with NSM that we hadn’t paid too much attention to before we began the SMI integration.
In case the concept is new to you, a service mesh is an abstraction layer placed on top of microservices that provides benefits like discoverability, resilience, configurability, observability, and security. There are now quite a few different solutions available, including Istio, Linkerd, Consul, Kuma, and more. Each builds its abstraction layer and API in its own way, which makes it hard to switch between them if you need to change the solution you’re using for any reason. SMI addresses that challenge by creating a common, portable set of service mesh APIs that let developers create service mesh-compatible applications without binding them to any specific implementation.
SMI has been around for less than a year and so far, covers some of the most common service mesh use cases like traffic management, measurement, and access control across and between most service mesh offerings. NSM, however, is somewhat different from its peers. Most notably, it covers the lower network traffic layers (Layers 2 and 3) that are not that native in Kubernetes and thus allows real networking to be created on top of containers. So, while integration would help NSM work more easily with some of the upper layer service mesh solutions and make it more easily adoptable as well, effecting the integration wasn’t entirely straight forward.
The first thing we focused on was adopting SMI’s metrics specification, which we figured we could apply to observe traffic. Doing that was revelatory; it brought home to us the reality that having networking without good traffic observability is like running containers without logs – you have no easy way to understand what’s happening or what the problem you are dealing with is when anything goes wrong.
That led us to look closer at the SMI specification and how it presents metrics. We then compared it with what we were doing. That exposed areas where we could improve our own code.
Our vpp and kernel forwarding planes, for example, reported metrics about the traffic they were carrying, but those numbers weren’t triggered anywhere or observable in a good way. You could find them in the logs, but that required searching through and doing manual analysis on thousands of lines of log data before you could locate a specific problem or behavior.
That inspired us to make our metrics far more observable and also helped to solve problems we hadn’t noticed before, including fixing an issue with how vpp-agent was sending metrics. These changes had several additional benefits:
- Our metrics were now exposed beyond the cross connect monitor. SMI aligns its metrics with Prometheus, so we started storing our data in Prometheus too, making it compatible with SMI’s API.
- Our cross connects data became more descriptive and informative when analyzing any specific connection.
With the Prometheus integration, we effectively achieved our goal of integrating NSM with SMI. But we had another problem. We could observe cross connect data, but it wasn’t very useful from a user perspective because the only information we recorded for identifying a cross connect source or destination was the network namespace, which you can’t use to observe traffic between specific pods. It wasn’t very searchable, or human-readable, in other words.
That led us to make one other change: allowing pod names to be exported in the cross connects, which enables a user to find metrics spanning specific pods and namespaces. That further increased NSM’s usability because you can easily identify a client or an endpoint with a specific pod within your cluster.
To be more specific, the traffic values sent by the forwarding plane as rx_bytes, tx_bytes, rx_packets, tx_packets, rx_error_packets, and tx_error_packets were now visible in Prometheus for every client-endpoint connection.
As a result of these various efforts, NSM will soon offer much better traffic observability – and therefore be more useful in real use cases. It will be easier for DevOps users to find and understand any issues that they might be having. And of course, it will be easier for anyone already using SMI to also deploy NSM.
So what started out as an integration effort turned out to be much more beneficial for our project than we expected. We now have the benefit of the integration itself, of course, but everything that was required of us to get there would have been worth doing for its own sake as well.