Page cover image

Kubernetes troubleshooting guide

kubectl

The kubectl command isn't found

Make sure you are in the Deployment Unit (DU) or the master node of the cluster. kubectl utility won't be available in worker nodes.

If you are in DU and the kubectl is still not available, make sure to add kubectl to $PATH.

WARNING: Kubernetes configuration file is group/world-readable

To remove group readable permissions,

chmod g-r ~/.kube/config

To remove world readable setting chmod o-r ~/.kube/config

refer: https://github.com/helm/helm/issues/9115

error: You must be logged in to the server

[root@master0 ~]# kubectl get nodes
error: You must be logged in to the server (the server has asked for the client to provide credentials)

This happens even though, the configuration is available,

This means the kubectl interface is unable access credentials. A workaround is

$ kubectl --token=<token> get nodes

Refer: https://kubernetes.io/docs/reference/access-authn-authz/authentication/#option-2-use-the-token-option

Pod stuck in pending state

ip-reconciler race condition with Whereabouts, leads to IP cleanup and duplicate IPs · Issue #162 · k8snetworkplumbingwg/whereabouts (github.com)

steps to troubleshoot pending state - https://containersolutions.github.io/runbooks/posts/kubernetes/pod-stuck-in-pending-status/

Deployed Workloads

CrashLoopBackOff

A container is repeatedly crashing after the restarts. There are multiple reasons for this error. Take the help of pod logs for additional ideas.

kubectl logs <podname>

<podname> is the problematic pod. If the previous instance of the pod exists you can pass -p flag for its logs too.

Crashed containers restart with an exponential delay of 5 minutes.

ref: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy

Check "Exit Code" of the crashed container

  1. Describe the problematic pod as

    kubectl describe pod <podname>

    replace <podname> with pod name.

  2. Check for containers: CONTAINER_NAME: last state: exit code field

    1. if the exit code is 1, the container crash is due to application crash

    2. if the exit code is 0, verify the duration of the app run

Containers exit when the application's main process exits. If the app finishes up faster, the container might try to restart.

Example if the exit code is 0

STEP 1: Identify the problem

A pod in CrashLoopBackOff status

[root@control-plane ~]# kubectl get pods -A | grep app
app-ns app-pods-dxaifa                                      3/3     Running            0          9h

STEP 2: Gather information

Gather information, and exit code such as ip, missing files, any error messages and exit code and reasons.

Containers:
  upf:
    Container ID:  docker://89faf;dakj;safjoiwqreqwrwqaaaafafrb7c826bee916de4d5eb
    Image:         docker..-distroless
    Image ID:      docker-pullable://example.com/abc/afdsafafsafs@sha512:afsadfdwwqrewqrqfasdsgasgasdfghjkl;wertyuio
    Port:          9013/TCP
    Host Port:     0/TCP
    Command:
      /bin/bash
    Args:
      -c
      ip6tables -F;;/bin/bash
  
    Restart Count:  137

Example if the exit code is other than 0 or 1

STEP 1: Identify the problem

Exit Code: 137 for the container means Out of Memory exception.

A pod in CrashLoopBackOff status

[root@control-plane ~]# kubectl get pods -A | grep app
NAME                       READY   STATUS             RESTARTS   AGE   IP              NODE          NOMINATED NODE   READINESS GATES
app-pods-afdaba   2/3     CrashLoopBackOff   27         97m   10.20.30.40   10.39.12.56   <none>           <none>

Pods comes once again to running state

[root@leoc2-mwp-21-3-1-master0 ~]# upf1
NAME                       READY   STATUS      RESTARTS   AGE    IP              NODE          NOMINATED NODE   READINESS GATES
app-pods-afdaba   2/3     Running   27         97m   10.20.30.40   10.39.12.56   <none>           <none>

STEP 2: Gather information

Gather information, and exit code such as ip, missing files, any error messages and exit code and reasons.


STEP 3: Problem analysis

Here in the container the error is about the json file validation. So, some of the entries in the json file are without values hence failed the validation.


STEP 4: Resolution and Root cause

Here we will agree upon the resolution and possible root cause.

Connect to a running container

Shell into the pod:

kubectl exec -it <podname> -- /bin/bash

if there are multiple containers, use -c <container_name> for a specific container.

Now, you can access the bash terminal of the pod's container where you can check for network, file access, and databases etc.

ImagePullBackOff and ErrImagePull

Container image cannot be loaded from the image registry.

If the image is not found

Make sure to check the following

      1. Verify the image name

      2. Verify the image tag. ( :latest or no tag pulls the latest image. And old tags may not be available)

      3. If the image should either have full path. Also, check for the inherited Docker Hub (artifactory or harbor) registry links.

      4. Try to pull the docker image via terminal:

        1. SSH into the node (master or worker) generally, ssh root@10.69.a.b should work both in powershell or bash

        2. Run docker pull <image name> . For example, docker pull docker.io/nfvpe/sriov-device-plugin:latest .

      5. If this method works, you can either add https://kubernetes.io/docs/concepts/containers/images/#specifying-imagepullsecrets-on-a-pod . Also, the image pull secrets are for single namespace only.

Permission denied error

If the error is either "permission denied" or "no pull access" error, verify that you have access to the image.

In either case, check whether you could download the image with (or a similar method)

wget --user=a.b@mavenir.com --password=AKCrK13VegV5bKtVZ <artifact-url>

ref: https://kubernetes.io/docs/concepts/containers/images/#using-a-private-registry

You need to make sure that the group (organizational team) you are in has access to the registry.

Pod unschedulable

Pod cannot be scheduled due to insufficient resource or configuration errors.

Insufficient resources

Error messages can be like,

  • No nodes are available that match all of the predicates: Insufficient cpu (2) which means, on the two nodes there isn't enough CPU available to fulfill the pod's requests

The default CPU request is 100m or 10% of a CPU. spec: containers: resources: requests spec can be updated as per the requirement.

Note: The system containers in the kube-system also use the cluster resources

ref: https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu

MatchNodeSelector

MatchNodeSelector means that there are no nodes that match the Pod's label selector.

Check the labels under in the Pod specification's nodeSelector field

nodeSelector:
  spec:
    nodeSelector:

See the node labels

kubectl get nodes --show-labels

Add a label to the node as

kubectl label nodes <node_name> <label_name>=<label_value>

ref: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/

PodToleratesNodeTaints

PodToleratesNodeTaints says that the Pod can't be scheduled to any node because no node currently tolerates its node taint.

1. You can patch the node like this

kubectl patch node 10.ab.cd.efg -p '{"spec":{"unschedulable":false}}'

For example, kubectl patch node 10.20.30.123 -p '{"spec":{"unschedulable":false}}'

2. or you can remove taint like this

kubectl taint nodes <node_name> key:NoSchedule-

ref:

1. Worker nodes have status of Ready,SchedulingDisabled · Issue #3713 · kubernetes/autoscaler

2. https://cloud.google.com/kubernetes-engine/docs/how-to/node-taints

3. https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/

PodFitsHostPorts

Pod is already in use. Change the port name in the Pod specification.

spec:
  containers:
    ports:
      hostPort:

Does not have minimum availability

If there is no availability even if a node has sufficient resources – the nodes might be in SchedulingDisabled or Cordoned status.

Get the nodes to see the status

kubectl get nodes

If the Scheduling is Disabled, try uncordoning the node

kubectl uncordon <node_name>

pod has unbound immediate persistentvolumeclaims · Issue #237 · hashicorp/consul-helm (github.com)

check that the pvc are in bound state.

Pods stuck in Terminating state

kubelet Unable to attach or mount volumes

kubelet Unable to attach or mount volumes: unmounted volumes=[config-volume], unattached volumes=[sriov-device-plugin-token-ts8p5 devicesock log config-volume device-info]: timed out waiting for the condition

sol: You should not mount single pvc on same pod twice. https://stackoverflow.com/q/69544012

References

Few troubleshooting guides

Last updated