kubernetes

Getting Started with Application Monitoring with Prometheus on VMware Enterprise PKS

As monolithic apps are refactored into microservices and orchestrated with Kubernetes, requirements for monitoring those apps are changing. To start, instrumentation to capture application data needs to be at a container level, at scale, across thousands of endpoints. Because Kubernetes workloads are ephemeral by default and can start or stop at any time, application monitoring must also be dynamic and aware of Kubernetes labels and namespaces. A consistent set of rules or alerts must be applied to all pods, new and old. 

Observability should always be top of mind when you’re developing new apps or refactoring existing ones. Maintaining a common layer of baseline metrics that applies to all apps and infrastructure while incorporating custom metrics is extremely desirable. No new metric based on user feedback should trigger a major replumb of your monitoring stack.

The open-source community is converging on Prometheus as the preferred solution to address challenges associated with Kubernetes monitoring. The ability to address evolving requirements of Kubernetes while including a rich set of language-specific client libraries gives Prometheus an advantage.

After a quick overview of the Prometheus architecture, this blog uses a self-service registration app written in Node.js to demonstrate how to leverage the Prometheus client library to instrument application latency metrics. You will then see how to visualize the newly defined metrics by using Grafana. All of this runs on top of VMware Enterprise PKS.

This blog assumes you already have a working deployment of Prometheus using the Prometheus Operator. Getting a Prometheus instance running on VMware Enterprise PKS is straightforward; see the step-by-step installation instructions.  

Architecture for Monitoring Kubernetes with Prometheus

Here’s a diagram that shows the architecture for monitoring Kubernetes with Prometheus and displaying the events in Grafana:

Prometheus consists of multiple components:

  • The Prometheus server scrapes and stores time-series data
  • Client libraries instrument application code
  • The Push Gateway routes metrics from jobs that cannot be scraped
  • Exporters support third-party systems or short-lived jobs
  • The Alertmanager generates alerts and notifications

In addition, Grafana, which is a popular open platform for analytics, provides the data visualization layer.

Prometheus offers a built-in Kubernetes integration. It's capable of discovering Kubernetes resources like nodes, services, and pods and capturing metrics from them. Not all the components are required to instrument application latency metrics. This blog focuses on the bare minimum: the Node.js Client Library, the Prometheus Server, and Grafana.  

Instrumenting Default and Custom Metrics

It’s straightforward to get started capturing application metrics using Prometheus. Prom-client npm is designed to collect metrics from Node.js applications and expose them to Prometheus. You can begin to capture default metrics such as memory and heap size by calling collectDefaultMetrics.

Here’s how to enable prom-client in your Node.js app’s source code to expose default application metrics: const Prometheus = require('prom-client') const collectDefaultMetrics = client.collectDefaultMetrics;

Beyond default metrics, a custom histogram works well as a way to create a baseline of application latency. For those unfamiliar with it, a histogram provides a statistical breakdown of possible results to an event using a set of buckets. Each time an event occurs, we will monitor the result and assign the result to a bucket. For Node.js apps, Node.js will record the response time of every request and count it in the corresponding bucket.

However, due to the asynchronous nature of Node.js, it can be tricky deciding where to place instrumentation logic to start or stop the application response timers required by a histogram. Luckily, when you use the Express framework for Node.js web apps, express-prom-bundle dramatically simplifies this process. Internally, express-prom-bundle uses prom-client.

You can install this library for your use with this command: npm install express-prom-bundle --save

After adding the following three lines to server.js, all routes or paths registered with the framework will be measured using dedicated histograms:

const promBundle = require("express-prom-bundle");

const metricsMiddleware = promBundle({includeMethod: true, includePath: true});

app.use(metricsMiddleware);

A quick note on the promBundle settings: includeMethod annotates the HTTP method, such as GET or PUT, with each histogram. The includePath setting creates a dedicated histogram for each page. A downside to enabling includePath is the potential metric explosion, especially for large apps that include a large number of links. Metrics can also get noisy because the setting doesn’t distinguish between internal (loading local images, for example) and external links (database calls to RDS). It is recommended to apply filter rules and leverage path templates to remove noise and normalize the monitoring output.  

Setting Up Metrics Collection and Retrieval

Prometheus uses the HTTP pull model, which means that applications need to expose a GET /metrics endpoint that can be periodically fetched by the Prometheus instance. If you are using prom-client without express prom bundle, you need to define the following:

app.get('/metrics', (req, res) => { res.set('Content-Type', Prometheus.register.contentType) res.end(Prometheus.register.metrics()) })

// listening on port app.listen('1112', () => { console.log('Server started on port 1112'); });

With the express-prom-bundle, auto registration of the /metrics endpoint is on by default.  

Viewing Queries

Once metrics are enabled, you can view the sample metrics using http://:1112/metrics. Below is an abridged output capture. You will notice a dedicated histogram for each page; included with each histogram is the path, method, response code, and how long requests took to respond.

# HELP http_request_duration_seconds duration histogram of http responses labeled with: status_code, method, path

# TYPE http_request_duration_seconds histogram http_request_duration_seconds_bucket{le="0.3",path="/user",method="GET",status_code="304"} 0 http_request_duration_seconds_bucket{le="1.5",path="/user",method="GET",status_code="304"} 1 http_request_duration_seconds_sum{status_code="304",method="GET",path="/user"} 0.733056076 http_request_duration_seconds_count{status_code="304",method="GET",path="/user"} 1

http_request_duration_seconds_bucket{le="1.5",path="/get-user-cluster",method="POST",status_code="302"} 0

http_request_duration_seconds_bucket{le="10",path="/get-user-cluster",method="POST",status_code="302"} 1 http_request_duration_seconds_bucket{le="+Inf",path="/get-user-cluster",method="POST",status_code="302"} 1 http_request_duration_seconds_sum{status_code="302",method="POST",path="/get-user-cluster"} 5.316734588 http_request_duration_seconds_count{status_code="302",method="POST",path="/get-user-cluster"} 1  

Monitoring with Prometheus

 We will use the Prometheus Operator CRD, ServiceMonitor, to discover metric endpoints. An endpoint consists of a namespace, service selector, and port. In our example, sample-app is the name of the namespace, sample-svc is the name of service selector, and sample-port is the port to scrape metrics. Service:

  1. apiVersion: v1
  2. kind: Service
  3. metadata:
  4.   name: sample-svc
  5.   namespace: sample-app
  6.   labels:
  7.     app: sample-svc
  8. spec:
  9.   ports:
  10.     # the port that this service should serve on
  11.  – name: sample-svc
  12.     port: 1112
  13.     protocol: TCP
  14.     targetPort: 1112
  15.   selector:
  16.     app: sample-app
  17.   type: LoadBalancer

When used with the Prometheus Operator, the Prometheus server will add endpoints that match the ServiceMonitor criteria for monitoring. We need to generate a ServiceMonitor definition to resemble the service definition outlined in the previous step. Specific to our example, create a ServiceMonitor YAML definition similar to the one built below and name it app-servicemonitor.yaml. Lines 13 -19 of the ServiceMonitor config should match the corresponding service definition in lines 4 – 7 and 11 – 14.

  1. kind: ServiceMonitor
  2. metadata:
  3.    labels:
  4.     serviceapp: sample-app-servicemonitor
  5.     app: sample-svc
  6.     release: prometheus
  7.   name: sample-app-servicemonitor
  8.   namespace: monitoring
  9. spec:
  10.   namespaceSelector:
  11.      matchNames:
  12.     – monitoring
  13.     – sample-app
  14.   selector:
  15.     matchLabels:
  16.     app: sample-svc
  17.   endpoints:
  18.   – port: sample-svc
  19.     interval: 10s
  20.   targetLabels:
  21.     – sample-app

You can log into the Prometheus UI to confirm whether Prometheus successfully discovered application pods that are mapped to the service definition. By default, the Prometheus installation does not expose UI access from outside the cluster. Enable it with port-forwarding:

kubectl port-forward prometheus-prometheus-oper-prometheus -n monitoring 9090:9090

Once port-forwarding is enabled, access the Prometheus UI by using http://127.0.0.1:9090 

Prometheus will scrape the newly discovered endpoint based on the interval defined in the ServiceMonitor definition. Captured data can be visualized in Grafana as a custom dashboard.  

Visualizing the Data in Grafana

Grafana is an open-source visualization tool that can be used on top of a variety of different time series data sources, including Prometheus. Grafana allows application owners to build comprehensive application dashboards, leveraging graph, table, heatmap, and free-text panel types to visualize key performance metrics. Grafana is bundled with the Prometheus Operator, which creates, configures, and manages Prometheus clusters on Kubernetes. The default Prometheus Operator installation includes Kubernetes cluster health monitoring.

To create a dashboard in Grafana, log in to the Grafana console. Click the + sign, select Dashboard, and then click the Graph option. An empty panel will open. Click Panel Title and choose Edit. The instrumented metrics are under the query expressions window:

Repeat this procedure for all your paths. Here is a sample dashboard:

Wrapping It Up

As we develop new apps or refactor existing apps, observability should always be top of mind. With the ability to address an evolving set of requirements for Kubernetes, combined with a rich set of language-specific client libraries, Prometheus is making instrumentation easy and painless when used on top of VMware Enterprise PKS. Get started today at https://github.com/CNA-Tech/Apps-on-PKS/tree/master/prometheus.