Troubleshooting Clusters with Crash Recovery and Diagnostics for Kubernetes

When your cluster is properly configured and working as intended, Kubernetes can be a beautiful thing. Inevitably, however, your Kubernetes cluster will break for any number of reasons. As a complex system, debugging a broken Kubernetes cluster is a multi-step endeavor that usually starts with the collection of machine states and other diagnostic data to be analyzed by an operation or a customer reliability team.

As part of the Tanzu umbrella of open source projects, VMware created a new open source project – Crash Recovery and Diagnostics for Kubernetes (or Crash Diagnostics for short). This project is designed to help troubleshoot problem clusters by automating the collection of machine states and diagnostic data from unstable or inoperable clusters.

Read more from the Crash Diagnostics GitHub repository.

Yet Another Diagnostic Tool?

You may be wondering at this point: Does the Kubernetes community need yet another diagnostics tool? That would be a fair question as there are several tools available that are usually deployed as pods to help diagnose a running cluster. One example is Sonobuoy, an open source project from VMware that runs Kubernetes conformance tests and other plugins.

Crash Diagnostics, however, runs outside the cluster and investigates problems where the cluster may be partially or completely non-operational.

Collecting Troubleshooting Information

A series of commands is declared in a diagnostics file which specifies the resources to collect from cluster machines. Like a Dockerfile, the diagnostics file is a collection of line-by-line directives with commands that are executed on each specified cluster machine. The output of the commands is then added to a tar file and saved for further analysis.

For instance, when the following diagnostics file (saved as Diagnostics.file) is executed, it collects information from the two cluster machines specified with the FROM directive:

ENV remoteuser=adminop
FROM  192.168.176.100:22 192.168.176.102:22
AUTHCONFIG username:${remoteuser}  private-key:${HOME}/.ssh/id_rsa
WORKDIR /tmp/crashout

# copy log files
COPY /var/log/kube-apiserver.log
COPY /var/log/kube-scheduler.log
COPY /var/log/kube-controller-manager.log
COPY /var/log/kubelet.log
COPY /var/log/kube-proxy.log

# Capture service status output
CAPTURE journalctl -l -u kubelet
CAPTURE journalctl -l -u kube-apiserver

# Collect docker-related logs
CAPTURE journalctl -l -u docker
CAPTURE /bin/sh -c "docker ps | grep apiserver"

OUTPUT ./crash-out.tar.gz

On a machine with SSH access to the cluster nodes, the previous diagnostics file can be executed as follows:

$ crash-diagnostics --file Diagnostics.file

When the diagnostics file above is executed, the following actions take place:

ENV declares a named variable that can be referenced throughout the file.
FROM declares the machines on which commands will be executed.
AUTHCONFIG configures an SSH connection that will be used to connect to the node machines.
WORKDIR specifies a temporary location where gathered files are staged.
The COPY commands collects Kubernetes log files for the apiserver, scheduler, controller, kubelet, and kube-proxy.
The CAPTURE commands execute the specified commands on each node and capture the result in a file that is bundled in the tar file.
Lastly, the OUTPUT directive specifies the name and location for the generated archive file.

The Diagnostics File

Currently, the diagnostics file supports a small, but powerful, set of directives, including::

AUTHCONFIG – configures the user and key used for the SSH connection.
CAPTURE – runs a command and captures the result in a file.
COPY – used to specify files to copy.
ENV – declares environment variables.
FROM – lists machine addresses from which to retrieve data.
OUTPUT – specifies the output tar file to create.
RUN – runs the specifies command.
WORKDIR – specifies the staging directory from which the output file is created.

These directives allow you to automate the collection of valuable machine states regardless of whether the cluster is stable or not.

As shown in the earlier example, the diagnostics file also supports variable expansion, which provides a familiar feel for those who routinely use shell scripts. This variable expansion is demonstrated in the following snippet:

AUTHCONFIG username:${remoteuser} private-key:${HOME}/.ssh/id_rsa

The value of ${remoteuser} is resolved as the variable named remoteuser, which was declared with ENV in the previous example. The diagnostics file can also access predeclared variables, such as ${USER}, ${HOME}, and ${PWD}.

Project Roadmap

This project is only a few months old and has already found some interesting, but critical, uses from early adopters. There are, however, some big plans to make this project a great tool for the community at large. Here are a few items that we are considering implementing with help from contributors:

Troubleshooting recipes – a collection of diagnostics files that can help solve common Kubernetes cluster issues.
Tighter integration with Kubernetes – implementations of Kubernetes-specific directives to help extract cluster information directly from a running API server if available.
Pluggable backend – investigation of a possible pluggable internal backend that may use other mechanisms to reach remote machines other than with SSH.
Preliminary diagnosis – analyze the collected data for known and common problem patterns.

Getting Involved

Although we are just getting started, we look forward to contributors joining the project and shaping its direction and the community around it. You can:

Try out the latest release from GitHub
Share a diagnostics file and the problem it helped solve.
Come chat with us in #crash-diagnostics on the Kubernetes Slack.
Collaborate with us on GitHub by opening an issue or create a pull request.

Yet Another Diagnostic Tool?

Collecting Troubleshooting Information

The Diagnostics File

Project Roadmap

Getting Involved

Related Articles

The Shadow PaaS vs CaaS War: Cloud Foundry's Relevance in a Kubernetes World

Gain Insights into the Risks You Face from Open Source Dependencies with VMware Tanzu OSS Health Assessment

Spring Cloud Gateway for Kubernetes 2.2: A Focus on Enhanced GraphQL API Support

Improving Kubernetes Operations One Step at a Time

2023 Product Highlights from Tanzu CloudHealth