In the previous section on Kubernetes debugging workflow and
techniques, a general workflow
with a set of debugging commands were established to resolve most direct issues, including mistyped text
that causes crash loopback errors. It also provides a glimpse
as to the cause of an issue. At best, an error message or code is
obtained to search through documentation or the internet to find
the cause. On average, a number of avenues is found where the cause of
the problem may lie. At worst, no apparent leads are available. Can we do better?
Can we limit the number of places we can search or is there a way we can
short-circuit towards understanding causes of a problem more quickly?
What You Will Learn
In this section, you will learn:
- The top six common sources of problems.
- How to use a heuristic approach to debugging.
Top Six Common Sources of Problems
The following are the top six sources of problems in the field:
-
DNS: Domain Name System (DNS)
resolves the hostname to IP addresses. An example of a common issue is when an application is unable to reach the DNS server, or when the DNS server doesn’t contain the record. There are also less obvious issues seen in the field. For example, a bug in base OS images that doesn’t allow DNS IPV4, but does allow IPV6. -
Certificates – A simple idea with highly complex components
marred by varying implementations and different nomenclatures leads to
a difficult and confusing concept to understand.It is highly recommended that you have a good understanding of how
certificates work. The key concept is a formation of a
valid chain of trust that starts on a host from a server/leaf certificate to
multiple intermediate certificates, and finally, a Root/CA
certificate. It is through a formed trust, that an encrypted communication
session can begin between a host and a server. This Kaspersky
article
covers most of the basics described in detail above. For readers
desiring an interactive example, use dawu415’s certificate
tool to
learn about certificates.The following is an example of directly querying
www.google.com
and building
a trust chain from the perspective of the host computer.Note the certificate tree is going from the server/leaf certificate
(top) to the pre-installed root certificate where theSubject
andIssuer
are identical on the host computer’s system trust store (bottom)../cert info --host www.google.com 443 --------------------------------------------------------------------- Details of www.google.com:443 --------------------------------------------------------------------- Type: Server Certificate Subject: CN=www.google.com,O=Google LLC,L=Mountain View,ST=California,C=US Issuer: CN=GTS CA 1O1,O=Google Trust Services,C=US CN: www.google.com SANS: www.google.com Trust Chain: . └── www.google.com:443 ├── Subject: www.google.com ├── Issuer: GTS CA 1O1 ├───┐ └── www.google.com:443 ├── Subject: GTS CA 1O1 ├── Issuer: GlobalSign ├───┐ └── System Trust Store: GlobalSign ├── Subject: GlobalSign └── Issuer: GlobalSign
-
Routing and Proxy: This issue occurs when network packets do not route
to where they are expected, or when they fail to go through a proxy in order to
reach their intended destination. This ultimately leads to inaccessible route errors.
Another type of routing issue occurs when packets take a much longer, or incorrect,
route than necessary to reach their destination that results in high latency. -
Firewall: Some firewall rules can prevent access to resources, via Layer 4 and 7. A complex example,
may be that a firewall allows access to a resource on Layer 4 (TCP) but it bars access to that
resource on Layer 7 (HTTP). -
Configuration Error: This is usually caused by typing incorrect text that may
include non-obvious characters such as spaces and tabs. -
Bugs – There is always a chance that the problem is a software bug,
and you are going to need assistance. However, in many instances, it is one of the
other sources of problems in this list that are more likely to be the cause of an issue,
and not a software bug.
The Heuristic Approach to Debugging
This section describes how to use the heuristic approach to debug the top six common sources of problems (see the previous section, Top Six Common Sources of Problems). The heuristic approach is an effective, efficient way for you to identify the cause of a problem.
If information gleaned from the initial debugging workflow yielded no results, or had
multiple possibilities, you could start to reason about where to begin and use
the sources as heuristic to help pin-point the cause of your problems, for example, DNS.
The next part in the series of this learning path dives deeper into refining heuristic
through context test debugging.
Test validations and tools are required to use heuristics. The following are a couple of common tests and tools for each source:
-
DNS: The command,
nc -vz <ip-of-dns> 53
can be used to check connectivity to a DNS server. Alternatively,
you can useping <fqdn>
to check if a hostname can be resolved. This is usually indicated on the
first line of the ping output. You can also use similar tools, such asdig
,nslookup
orhost
commands to test access to DNS or to see if the DNS server can
resolve a hostname. Be aware of DNS caching results that are stale.
While it is not possible to change the caching behavior of a DNS, this could be
alleviated with,dig example.com +trace
This forces a trace output of the name resolution all the way through to an
authoritative DNS.While the above serves as a general recommendation for validating DNS of
where most common issues lie, another aspect to consider is name resolution of
services and pods within a Kubernetes cluster via CoreDNS. First, check if the
pods are running on a cluster. If not, follow the general workflow and techniques in the previous section to see what else is preventing
the CoreDNS pods from starting. Once they are working, access a container to
test the in-cluster DNS resolution. See “Accessing containers”
for more details on how to get onto a container. You can also read this
Kubernetes
document
for tips on how to debug in-cluster DNS resolution. -
Certificates: While implementation specific, CA root certificates are
installed with the intermediate certificate, in the trust store of a
node, for a full trust to occur. Do note that some components require a
specific certificate order, for example, an intermediate certificate
appears first in a file, followed by the root certificate or vice-versa.
See the component documentation for more details. In some systems, only having the intermediate
certificate in the trust store is sufficient because it implies that a
server already trusts the root CA. Refer to documentation on how to install the
CAs to be trusted target hostsThere are tools to assist with determining whether there is a trust chain,
for example, OpenSSL.openssl s_client -connect <fqdn/ip>:<port>
This command retrieves all the certificates and display them. There is an option in
openssl
to also check
certificate trust,openssl verify -CAfile rootcert.pem -untrusted intermediateCert.pem servercert.pem
Alternative tools, such as dawu415’s cert tool,
retrieve certificates from hosts and make it easier to check the trust chain from a set of
certificates maintained in a file, or from a server.A secondary issue with certificates is the line break of PEM encoded certificates. There should only
ben
(linefeed (LF) ) and notrn
(carriage return + linefeed (CRLF)), as
is usually inserted when working on Windows machines. Text editors like Visual Studio
Code can convert CRLF to LF. In some situations, applications require
a single line PEM and escaped line breaks, that is, the ‘invisible’ LF character
is converted to a character pairn
. To do this, enter the following command
in Linux:cat certificates.pem | awk -v ORS='\n' '1' | tr -d 'r'
Some applications are going to require a restart in order to recognize the new CA certificates after installing a CA certificate on the host that is running the docker daemon.
For example, the docker CLI throws the error,
x509: certificate signed by unknown authority
when connecting to an internal container registry that uses an internally signed CA untrusted on
the user’s machine. The docker daemon service also needs to be restarted after installing a CA certificate so that it can pick it up.To restart the docker daemon, enter the following command:
sudo systemctl restart docker
If you are on MacOS or Windows, restart the service that is available on the graphical user interface.
Some proxies will return zero bytes for SSL connections being made to IP
addresses/DNS records that are not in its main list. It usually manifests as
anSSL_SYSCALL_ERROR
in the browser, or as an SSL handshake failure incURL
. Use
openssl s_client
as documented above to debug. -
Routing & Proxy – To see if a server can be accessed, use
netcat,nc -vz <ip/fqdn> <port>
Or, if
nc
is not available,curl -kL telnet://<ip/fqdn>:<port> -vvv
This does a Layer 4 check that tests connectivity.
Proxy servers are another aspect to consider. Check for
HTTP_PROXY
andHTTPS_PROXY
environment variables to verify if the
servers are set, or need to be set. In some instances, complex configuration requires
theNO_PROXY
environment variable to be set for hosts that should not go through
a proxy. The existence of these environment variables in Linux can be checked by
entering the following command:env | grep -i proxy
For incorrect routing that causes high latency, use load test
tools to ascertain access time or simple checks using tools orcurl
’s,
write out timing variables to ascertain access time.
This provides useful information such as total time, and hostname
resolution time. The following is an example of usage:curl -s -w 'Total time: %{time_total}sn' https://tanzu.vmware.com
You need to compare the timing results to a gold standard, or be
able to reason why it is taking a long time. For example, if loading a page
typically takes 300 milliseconds, but now it is taking 5 minutes, there’s likely an issue.Finally, you can print the system’s route table if there is suspicion of a
route misconfiguration on a host. There are several ways to do this, depending on the operating system that you are using to troubleshoot an issue:- MacOS:
netstat -rn
(netstat -rnf inet
to only see IPv4 routes) - Linux:
route -n
- Windows:
route print
You can also test routing by using a traceroute tool to see which gateway
serves the initial hop,traceroute <ip_address>
On Windows, the command is
tracert
. Note thattracert
/traceroute
uses ICMP ECHO by
default. Consult themain
page for your version oftraceroute
to find options
that enable TCP- or UDP-basedtraceroute
and ensure that ICMP packets are allowed. - MacOS:
-
Firewall: The same debugging tools for routing can be used for firewalls.
To check Layer 7, do acurl -kL <fqdn/ip>
command.Note that
-k
only skips SSL validation for cases where HTTPS is not used or
when using untrusted/self-signed certificates. Be aware that it is possible for Layer 7
HTTP firewall rules that are in place, and sometimes Layer 4 TCP/IP & UDP is not
blocked when running a netcat check command. This could indicate that there is a firewall in place.tcpdump
is tool that you can use to inspect packets on the host machine, including TCP packets
for anRST
orReject
onACK
. A reference fortcpdump
can be found [here]
(https://gist.github.com/jforge/27962c52223ea9b8003b22b8189d93fb). -
Configuration Error: Be aware of applications that introduce
unwanted characters such as tabs, spaces and newlines.For example, Kubernetes secrets require a base64 input. This could conveniently
be performed with the commandecho 'xxx' | base64
. However, echo also appends a newline
character to the stringxxx
.To avoid this, use
echo -n 'xxx' | base64
, to remove the extra newline character
appended after thexxx
string.For a generic configuration, use a text editor or raw character output
to check the ASCII characters. Text editors such as Vim and Visual Studio Code can
assist with viewing these characters.For small strings, the Linux command,
echo 'xxx' | od -t x1
can assist in
displaying the hex output of the echoed string. You could then use an
ASCII table to assist with translating the
characters, which in this situation are the numbers 10 decimal (0x0A
hex) and 13
decimal (0x0D
hex) for the LF and CR characters, respectively.If the above fails, revert back to the last known working configuration or
known defaults, then incrementally test configuration changes going forward. -
Bugs: Be cognizant of bugs, and don’t be quick to provide a conclusion
without first checking release notes, having discussions with authors and teams, and having concrete findings documented in logs. If possible, look through the
source code to reference lines in your discussions, or to build tests against to validate your assumptions.