kubernetes

Learningpaths: Kubernetes debugging workflow and techniques

A fundamental skill needed by all practitioners deploying to Kubernetes is
debugging issues as they arise on the Kubernetes Platform. Issues can range
from application deployment issues, to Kubernetes system issues, or to network issues.
The problem is, what is a good starting point for debugging?

What You Will Learn

In this section, you will learn:

  • The general workflow and Kubernetes commands to debug an issue.
  • How to access containers.
  • How to access the Kubernetes nodes.

Tools

To help solidify your learning, it is recommended that you learn
these commands, because they work on vanilla and a majority of Kubernetes releases.

Note: A discussion about ephemeral debug containers is further in this section. It is
recommended that you use Kubernetes version 1.23 for the beta release of this feature,
enabled by default. If your Kubernetes versions is less than 1.23, you must manually
enable this feature.

The General Kubernetes Debugging Workflow and Techniques

The general workflow is to get the events, learn about the pod state and
then get the pod logs. If these steps do not provide sufficient information,
further steps are needed. For example, checking the application
configuration and network connectivity within a container.

  1. Always run:

    kubectl get events -n <namespace> --sort-by .lastTimestamp [-w]
    

    This provides a list of events that occur in a namespace. The most
    recent event is at the bottom of the list. Adding the optional -w flag
    will allow the output to be watched.

  2. Run:

    kubectl describe pod/<pod-name> -n <namespace>
    

    to get more details of a pod.

  3. Run:

    kubectl logs <pod-name> -n <namespace> [name-of-container, if multiple] [-f]
    

    This gets logs from a pod. Adding the optional -f flag enables
    tailing of the logs. Note that logs only work with the pod API, but you
    can still selects a pod to get logs from a deployment/daemonset perspective. For example,

    kubectl logs ds/<daemonset-name> -n <namespace> [name-of-container, if multiple] [-f]
    

    A useful command line tool is stern because
    it allows tailing of logs from multiple pods and containers. Stern can also
    apply regular expressions to filter specific logs. Kail
    is another similar log helper tool. For most debugging purposes, try to limit the logging scope to avoid
    cognitive overload that may distract you from finding the real problem.

    If you cannot access the cluster, check the log aggregation
    platform, if the cluster is connected to one. For example, Splunk or ELK if they
    are setup. From the log platforms, try searching by the node name/IP and/or
    components such as kubelet and kube-proxy. If the cluster is not accessible
    because kubectl commands do not work, you must access the cluster nodes to
    find more details. See “Accessing nodes” for more information.

  4. If the cluster is still accessible and existing logs pinpoints to another
    system issue, it is possible to access the pod container to perform tests such
    as validating the state of a running process, it’s configuration, or to check a
    container’s network connectivity. See “Accessing containers”
    for more information.

The above is a good general workflow to start debugging. The documentation
on Kubernetes Monitoring, Logging and Debugging
provides a good outline of other debugging approaches and techniques that could be
utilized.

Accessing Containers

The goal here is to access the container that an application is running on and
view log files and/or perform further debugging commands, such as validating
running configuration.

The following are some techniques to get on containers.

Use kubectl exec

Accessing a container can be achieved through the kubectl exec command.
However, this may not be permitted on a cluster due to RBAC
permissions. To exec into a container, run:

kubectl exec -it -n <namespace> <pod-name> [-c <container-name>] "--" sh -c "clear; (bash || ash || sh)"

Note: The above exec command is from Lens. Using
tools like Lens, can make this a lot easier via a GUI.

Sometimes there is no accessible shell. One other method is to
start a pod with a container having all the network debugging tools. The
netshoot container image has a number of
tools that make it easy to debug network issues. You can instantiate a
netshoot container and exec into it to perform tests as follows,

kubectl run netshoot --image=nicolaka/netshoot && 
kubectl exec -it -n netshoot "--" sh -c "clear; (bash || ash || sh)"

However, this is a new pod that does not have access to a
target pod’s filesystem. This approach is useful for debugging aspects such as
network connectivity. A deep dive of these aspects as problem sources are explored in
the next learning path on Heuristic Approach.

Using Ephemeral Debug Containers

As discussed in the previous section, using kubectl exec may be insufficient
due to the target container having no shell, or if the container has already
crashed and is inaccessible. To overcome this, use a different image
deployed via kubectl run. This should be a different pod that is not sharing
the process namespace of the pod of interest for debugging. This means you
cannot access the process details or its filesystem in a different pod. To alleviate
this limitation, use debug ephemeral containers.

Introduced as an alpha in Kubernetes 1.16 and currently as a beta in Kubernetes
1.23
,
ephemeral containers are helpful for troubleshooting issues. By default, the
EphemeralContainer feature-gate is
enabled

as a beta in Kubernetes 1.23. Otherwise, you must enable the feature-gate first before it can be used.
The documentation for debug containers
show usage examples demonstrating how to use debug containers. It is included
as follows, along with content aligning to this article.

Ephemeral containers are similar to regular containers. The difference is that they are
limited in functionality. For example, ephemeral containers have no restart capability,
no ports, no scheduling guarantees, and no startup/liveness probes. To view a list of
limitations, see What is an Ephemeral Container?.

Compared to what was previously done with netshoot and kubectl exec, the key
feature here is that an ephemeral container can be attached to a pod that you
want to debug. For example, the target pod, by sharing it’s process
namespace
.
This means that it is possible to see a target pod container’s processes
and filesystem. There are a number of variations of the kubectl debug
command. Its usage depends on what is required for debugging. One
additional note to consider is that the debug containers continue to run
after exit and needs to be manually deleted. For example, kubectl delete pod <name-of-debug-pod>.

The Target Container is Running, But Does Not Have a Shell

A no shell container means that it is not possible to add kubectl exec into a
container using a shell command, such as sh and bash. For this scenario,
the follow command can be run:

kubectl debug <name-of-existing-pod> -it --image=nicolaka/netshoot --target=<name-of-container-in-existing-pod>

This utilizes the netshoot image, with all the network debugging utilities
as a separate ephemeral container running in the existing pod named
<name-of-existing-pod>. It also opens an interactive command prompt using
-it. --target lets you target the existing process namespace of the
other existing running container. This is usually the same name as the pod.

The Target Container Has Crashed or Completed and/or Does Not Have a Shell

This scenario introduces the problem where a container has crashed or completed
and it is not possible to kubectl exec into a container because the container
no longer exists. You can address this by making a
copy of the target pod attached with an ephemeral container to inspect its
process and filesystem. Use the following command.

kubectl debug <name-of-existing-pod> -it --image=nicolaka/netshoot --share-processes --copy-to==<new-name-of-existing-pod>

This creates a debug container using netshoot. The debug container shares the process
namespace in a new pod, named <new-name-of-existing-pod> as a copy of
<name-of-existing-pod>.

The Target Container Has Crashed or Completed, and the Container Start Command Needs to be Changed

This scenario is different from the previous one, in that the startup command
needs to be changed to ensure that a container remains running or additional information is
extracted. This is achieved through a copied target pod without affecting the original
target pod. One situation where this is particularly useful is when a target container
has crashed or ends immediately on startup and the goal is to interactively
tryout the process or review its filesystem. To do this, the startup
command requires a change that does not end the container, or a command
that helps to provide more information. For example, changing the debug verbosity
parameter to an application, issuing a long sleep command, or run as a shell.
You can achieve this with the following command:

kubectl debug <name-of-existing-pod> -it --copy-to=<new-name-of-existing-pod> --container=<name-of-container-in-pod>  -- sh

Take note of the command specifier -- sh. The example changes the start
command to call the shell on the copied container. This may need to be replaced
with an appropriate start command that enables debugging of the pod.

The Target Container Needs to Utilize a Different Container Image

This scenario creates a copy of a target container with a different container
image. This is useful in situations where a production container may not contain
all the utilities that allow debugging, or does not have debug level outputs.
It is run with the follow command:

kubectl debug <name-of-existing-pod> --copy-to=<new-name-of-existing-pod> --set-image=*=nicolaka/netshoot

The difference is the parameter --set-image=*=nicolaka/netshoot,
which works the same way as kubectl set image, where it sets all
existing container images, specified by *, with the new image
nicolaka/netshoot.

Accessing Nodes

The goal of accessing the nodes is to ascertain why a pod running on the node is
failing, or why a particular node is failing. Two areas that you can check are the
log files of the pods, and the process status of Kubernetes static pods and
services. You can also perform debugging commands, such as netcat,
and curl to debug network connectivity issues if these tools exists within
the host.

To determine where the log files are located, it is recommended to check the
documentation of components to see which log files to access.
For most distributions of Kubernetes, log files can be found within
/var/logs/. For the containerd container runtime, /var/logs/pods holds pod
logs per namespace and container, and /var/logs/containers is a symlink of the
container logs in /var/logs/pods.

Key Kubernetes components, such as Kubelet and containerd (for containerd
container runtime), use systemd in most distributions to initialize Kubernetes
components. To retrieve logs so that you can inspect the journals of these components,
run the following command:

journalctl -u <name-of-component>

Usually, <name-of-component> is just kubelet. If that doesn’t work, you can
find the systemd unit for kubelet by greping for kubelet inside of
/etc/systemd/system.

If the Kubernetes components were not started by systemd, use lsof against
the kubelet process to see where the logs are getting written to:

pgrep -i kubelet | xargs lsof

In some Kubernetes distributions, the Kubernetes components run as Docker
containers. If they are not running in systemd, use docker ps or crictl ps
to check if this is the case.

Some Kubernetes distributions also configure the Kubernetes components to send
their logs to syslog, which is usually written to /var/log/messages.

Different Ways to Get on a Node

For readers unfamiliar with how to access nodes, this section describes three methods
of how to get on a node.

  1. Use SSH.

    SSH access is performed via SSH keys. To access the nodes via SSH keys,
    run the following command:

    ssh -i <certificate.pem> <username>@<ip/fqdn of node>
    

    If only password access is setup, you can SSH with the following command.

    ssh <username>@<ip/fqdn of node>
    

    and enter the password thereafter, when prompted.

  2. Accessing the nodes via kubectl exec.

    If the RBAC permits and with privileged container permissions, it is possible to
    mount a node’s log as a hostpath into a pod, and inspect those logs after
    kubectl exec‘ing into a container. The following yaml creates a
    debugging pod to help with this:

    kubectl apply -f - <<-EOF
    apiVersion: v1
    kind: Deployment
    metadata:
      name: debugging
      labels:
        app: debugging
    spec:
      template:
        metadata:
          labels:
            app: debugging
        spec:
          containers:
          - image: nicolaka/netshoot
            volumeMounts:
            - mountPath: /node-logs
              name: node-logs
          volumes:
          - name: node-logs
            hostPath:
              directory: /var/logs
              type: Directory
    EOF
    
  3. Debugging nodes use ephemeral debug containers.

    In addition to the other use cases of debug containers, it is also possible
    to create an ephemeral privileged container on a target node of interest
    with its filesystem mounted at /host. Ephemeral debug containers
    opens access to the node’s host filesystem, its network, and process
    namespace. This is run with an interactive shell using the following
    command:

    kubectl debug node/<name-of-node> -it --image=nicolaka/netshoot