Summary:
The Monitoring Indicator Protocol is an observability-as-code project which allows developers to define and expose performance, scaling, and service level indicators for monitoring and alerting. It encourages the practices of Site Reliability Engineering or SRE as defined in Google's freely available SRE book. Service teams within the Cloud Foundry and Kubernetes communities should consider using this tool to help platform engineers operate systems more effectively.
Running Services You Don’t Own Is Hard
As systems grow in complexity, how do you maintain your peace of mind? How do you remain confident that you’ll be alerted when things go wrong? The high-level answer is ‘observability.’ You want systems to be able to tell you when they are unhealthy, and allow you to quickly diagnose problems when they arise.
If you are operating Pivotal Cloud Foundry, you probably have the platform well monitored already. If you’re a PCF Healthwatch user, you get a lot of what you need for very little effort. Not using PCF Healthwatch yet? Check out the documentation to help determine how to monitor and alert on your platform effectively.
But as your PCF foundation grows, and you bring in more services, how do you extend that peace of mind to these new capabilities? These services emit their own metrics, which need to be observed effectively as well. And if you’re keen to adopt Site Reliability Engineering principles, you’ll need to be able to distinguish between important service level indicators (SLIs) and metrics which are more useful for diagnostic purposes. You’ll also need to determine the best alerting thresholds for those SLIs. Inaccurate thresholds can both wake you up in the middle of the night for no reason, and allow you to sleep through production meltdowns.
Even if you have the knowledge, it can be laborious to set up dashboards and alerts for each new service installed. Additionally, it’s possible for metrics to be removed, renamed, or otherwise changed as the platform and its services evolve. Any of those changes can jeopardize your carefully constructed dashboards and alerts. If this sounds complicated, you’re right. But it gets worse. (We’ll explain how it gets better in a moment.)
Let’s consider monitoring and alerting. Assume you can set up all the monitors and alerts you need in a satisfactory fashion. What do you actually do when an alert is triggered? Given that you may not have ready access to the team which created the code, how do you start diagnosing the problem? Is there an easy fix?
More to the point: how are the Cloud Foundry and Kubernetes communities evolving and improving to help you best monitor an individual service?
For our part at Pivotal, we saw a need for a more direct line of communications from service authors (the folks who build and maintain components within these open source projects) to platform engineers, the people that run distributed systems at scale for the world’s largest organizations. Let’s explore this situation a bit more.
Maintaining an Observability Plan Without Treating it as Code is Hard
As systems grow and change, so should their monitoring and alerting configurations. If these changes are executed manually, it can be very hard to track when things change. After all, even a stable and sound observability plan is very difficult to recreate, if you lose the server that hosts your hand-crafted configuration!
We’ve seen a broad shift in the industry towards maintaining configuration and infrastructure setup as code. This practice encourages the use of source control management tools to maintain a clear history and reproducible state. It seems like a logical next step to treat our monitoring and alerting configuration as code.
Once service authors are able to provide this sort of configuration alongside their services, there is another logical step. Consumers of services should able to make additions or modifications to these plans using files. This way, users can maintain customizations in source control.
Existing Tools in This Space
If this is a problem worth solving, what tools do we have? We decided to investigate the existing landscape. We wanted to see if there were open source tools we could use, in favor of creating something new. The most significant existing effort we found is the Grafana Mixins tool.
Grafana Mixins are a powerful way to express a set of graphs and alerts in Grafana and Prometheus Alert Manager. The tool uses JSonnet to allow for the composition of provided observability plans and customizations to those plans. However, the consumer must still retrieve and configure these plans separately from installing any relevant services. Our goal is to provide a more complete “out-of-the-box” experience.
A few engineers from Pivotal decided to contribute our own solution.
Enter Monitoring Indicator Protocol
Monitoring Indicator Protocol represents an effort to alleviate the aforementioned issues and allow the community to have an easy out-of-the-box solution for full-scale observability. At its core, Monitoring Indicator Protocol conveys a simple idea: services should be distributed with their observability definitions (aka an indicator document). Here’s a sample document defining a single indicator for simplicity:
---
apiVersion: v0
product:
name: uaa
version: 68.0+b1
metadata:
deployment: cf
indicators:
- name: uaa:throughput
promql: rate(requests_global_completed_count[1m])
- name: uaa:latency:95p
promql: quantile_over_time(0.95, latency_uaa[1m])
thresholds:
- level: warning
gt: 175
The key conceptual component in this example is the indicators
list. This is where the service provider will define the SLIs for their service, as well as the relevant alerting thresholds if any. The documentation section of each indicator definition allows service authors to qualify their service metrics. This way, authors can highlight the why behind each metric, and give consumers an idea of how to address issues which may arise. A more in-depth description of the contents of the indicators document can be found on the wiki.
We understand that everyone has unique needs. Platform engineers will need to customize the indicator definitions provided by service authors. For example, the requests per second critical threshold of 7500 might be a good default for some service. But in a faster environment, a single node may be able to handle 10000 reqs/second. It doesn’t make sense to alert and scale up in this scenario of underutilization since we would be wasting infrastructure resources. To accommodate this, we have defined a path for platform engineers to patch any property of an indicator document as needed. The simplest path: to maintain a git repository for these patches, and inform the indicator registry of how to access this repository. Details can be found here.
Additionally, we created a tool that allows service authors to validate their indicator definitions, as well as generate an html page, a Grafana dashboard, or a Prometheus alert configuration based on those definitions. This means that service authors can save time by keeping observability related information in one file: dashboards, alerts, and documentation. Details on the commands used to accomplish are found on the wiki as well.
We hope to see service authors embrace Monitoring Indicator Protocol so that whenever a new service is introduced, there is little to no setup necessary to begin monitoring its health. Bundling service versions with their respective indicators would create consistency between upgrades and rollbacks. Need to switch to an older version? No big deal, the old versions of relevant indicators are brought back alongside the downgrade. Any changes to customization can be rolled back using version control as well.
How It Works
Each BOSH VM runs a process called the Monitoring Indicator Protocol Registration Agent. The agent will watch the /var/
vcap
/jobs
directory. Any folders found in that directory will be inspected for a document called indicators.yml
nested inside of a config
folder. If an indicator document is located, the agent will parse the document and send the results to the Indicator Registry. This process is repeated every 2 minutes.
In the example below, the registration agent running on the RabbitMQ Server VM finds a document at /var/vcap/jobs/rabbit/config/indicators.yml
and sends the data to an indicator registry. The registry runs on its own BOSH managed VM, and is the source of truth for indicator definitions. State is maintained in memory, and indicator documents are evicted if not refreshed within 2 hours.
The indicator registry exposes a /indicator-documents
endpoint, which returns a JSON representation of the indicator documents. In the example diagram below, PCF Healthwatch consumes the results of this endpoint and creates a dashboard for each document. Each indicator is rendered as a chart on the dashboard. The chart data is populated by the PromQL query provided for each indicator. PCF Healthwatch uses the thresholds
section of each indicator to set up alerting rules based on the results of the query.
Monitoring Indicator Protocol in the Wild
PCF Healthwatch is particularly well suited to adopt Monitoring Indicator Protocol. As an aggregator, PCF Healthwatch needs to monitor and display the health of many different PCF services. As you can imagine, this involves lots of coordination across services teams.
A PCF Healthwatch dashboard, powered by Monitoring Indicator Protocol.
If we decouple health metric definitions from PCF Healthwatch, you can see how the module could become a broader tool capable of monitoring many of the different products that run atop Pivotal Cloud Foundry.
In PCF Healthwatch 1.5 and above, the Monitoring Indicator Protocol is supported out of the box. Any indicator definitions included with services are automatically registered in a new dashboard. PCF Healthwatch, PCF Metrics, and RabbitMQ for PCF are the first services that export indicator definitions. We expect other components and tiles to follow suit.
What about software that you run on Pivotal Container Service (PKS) and Kubernetes? Wouldn’t it be nice if they could also define indicator documents in a similar fashion, and automagically configure alerts and a dashboard in PCF Healthwatch? To address this, we’ve built a custom resource for k8s along with controllers for Prometheus and Grafana to keep dashboard and alert configuration up-to-date. This is expected to ship in an upcoming version of PKS.
So what can you do with this information right now? Check out our repository and wiki. Feeling ambitious? Try out the protocol in Kubernetes using our CRD branch. The best way to reach out or comment is to engage with the team on Github.