One of the most powerful features of VMware Enterprise PKS is its capability to manage desired state for Kubernetes clusters. This capability is provided in part by BOSH. For example, consider the following three-worker node cluster deployed in VMware Enterprise PKS:

On this Kubernetes cluster, I have deployed a simple web application with four pods and an external load balancer:


What is the impact of a worker node failure?

A worker node is marked as Condition Unknown after the master is unable to reach the node’s kubelet agent. If the node continues in Condition Unknown, it will evict the pods. The eviction process will restart pods on other nodes if possible. At this point, Kubernetes has restored the container applications to service but we are left with a Condition Unknown worker node.

The desired state of the Kubernetes cluster is to have three worker nodes. VMware Enterprise PKS restores the desired state for the Kubernetes cluster. In order to simulate this function of VMware Enterprise PKS, we will power off a worker node.

Simulating a node failure

We can identify worker nodes in VMware vSphere by reviewing the custom attributes and looking for worker in the job field:
worker node in VMware vSphere
Powering off this worker node will produce a warning from vSphere that it is managed by BOSH:
Ignoring the well-placed warning, the machine is powered off. Soon after being powered off, the worker node is marked as NotReady or Condition Unknown in Kubernetes:

After a reasonable time, Kubernetes rebuilds the failed pods on another node (as shown by the difference in age):

VMware Enterprise PKS enforces the desired state of the cluster by replacing the powered off node with a newly deployed node and removing the failed node. In a traditional Kubernetes environment, the replacement of a failed node is a manual process. VMware Enterprise PKS excels at providing high scalability in part through its high degree of automated day-two operations.