Links

Kubernetes troubleshooting guide

kubectl

The kubectl command isn't found

Make sure you are in the Deployment Unit (DU) or the master node of the cluster. kubectl utility won't be available in worker nodes.
If you are in DU and the kubectl is still not available, make sure to add kubectl to $PATH.

WARNING: Kubernetes configuration file is group/world-readable

To remove group readable permissions,
chmod g-r ~/.kube/config
To remove world readable setting chmod o-r ~/.kube/config

error: You must be logged in to the server

[root@master0 ~]# kubectl get nodes
error: You must be logged in to the server (the server has asked for the client to provide credentials)
This happens even though, the configuration is available,
This means the kubectl interface is unable access credentials. A workaround is
$ kubectl --token=<token> get nodes

Pod stuck in pending state

Deployed Workloads

CrashLoopBackOff

A container is repeatedly crashing after the restarts. There are multiple reasons for this error. Take the help of pod logs for additional ideas.
kubectl logs <podname>
<podname> is the problematic pod. If the previous instance of the pod exists you can pass -p flag for its logs too.
Crashed containers restart with an exponential delay of 5 minutes.

Check "Exit Code" of the crashed container

  1. 1.
    Describe the problematic pod as
    kubectl describe pod <podname>
    replace <podname> with pod name.
  2. 2.
    Check for containers: CONTAINER_NAME: last state: exit code field
    1. 1.
      if the exit code is 1, the container crash is due to application crash
    2. 2.
      if the exit code is 0, verify the duration of the app run
Containers exit when the application's main process exits. If the app finishes up faster, the container might try to restart.
Example if the exit code is 0
STEP 1: Identify the problem
A pod in CrashLoopBackOff status
[root@control-plane ~]# kubectl get pods -A | grep app
app-ns app-pods-dxaifa 3/3 Running 0 9h

STEP 2: Gather information
Gather information, and exit code such as ip, missing files, any error messages and exit code and reasons.
Containers:
upf:
Container ID: docker://89faf;dakj;safjoiwqreqwrwqaaaafafrb7c826bee916de4d5eb
Image: docker..-distroless
Image ID: docker-pullable://example.com/abc/afdsafafsafs@sha512:afsadfdwwqrewqrqfasdsgasgasdfghjkl;wertyuio
Port: 9013/TCP
Host Port: 0/TCP
Command:
/bin/bash
Args:
-c
ip6tables -F;;/bin/bash
Restart Count: 137

Example if the exit code is other than 0 or 1

STEP 1: Identify the problem
Exit Code: 137 for the container means Out of Memory exception.
A pod in CrashLoopBackOff status
[root@control-plane ~]# kubectl get pods -A | grep app
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
app-pods-afdaba 2/3 CrashLoopBackOff 27 97m 10.20.30.40 10.39.12.56 <none> <none>
Pods comes once again to running state
[root@leoc2-mwp-21-3-1-master0 ~]# upf1
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
app-pods-afdaba 2/3 Running 27 97m 10.20.30.40 10.39.12.56 <none> <none>

STEP 2: Gather information
Gather information, and exit code such as ip, missing files, any error messages and exit code and reasons.

STEP 3: Problem analysis
Here in the container the error is about the json file validation. So, some of the entries in the json file are without values hence failed the validation.

STEP 4: Resolution and Root cause
Here we will agree upon the resolution and possible root cause.

Connect to a running container

Shell into the pod:
kubectl exec -it <podname> -- /bin/bash
if there are multiple containers, use -c <container_name> for a specific container.
Now, you can access the bash terminal of the pod's container where you can check for network, file access, and databases etc.

ImagePullBackOff and ErrImagePull

Container image cannot be loaded from the image registry.

If the image is not found

Make sure to check the following
  1. 1.
    1. 1.
      1. 1.
        Verify the image name
      2. 2.
        Verify the image tag. ( :latest or no tag pulls the latest image. And old tags may not be available)
      3. 3.
        If the image should either have full path. Also, check for the inherited Docker Hub (artifactory or harbor) registry links.
      4. 4.
        Try to pull the docker image via terminal:
        1. 1.
          SSH into the node (master or worker) generally, ssh [email protected] should work both in powershell or bash
        2. 2.
          Run docker pull <image name> . For example, docker pull docker.io/nfvpe/sriov-device-plugin:latest .
      5. 5.
        If this method works, you can either add https://kubernetes.io/docs/concepts/containers/images/#specifying-imagepullsecrets-on-a-pod . Also, the image pull secrets are for single namespace only.

Permission denied error

If the error is either "permission denied" or "no pull access" error, verify that you have access to the image.
In either case, check whether you could download the image with (or a similar method)
wget --user=a.b@mavenir.com --password=AKCrK13VegV5bKtVZ <artifact-url>
You need to make sure that the group (organizational team) you are in has access to the registry.

Pod unschedulable

Pod cannot be scheduled due to insufficient resource or configuration errors.

Insufficient resources

Error messages can be like,
  • No nodes are available that match all of the predicates: Insufficient cpu (2) which means, on the two nodes there isn't enough CPU available to fulfill the pod's requests
The default CPU request is 100m or 10% of a CPU. spec: containers: resources: requests spec can be updated as per the requirement.
Note: The system containers in the kube-system also use the cluster resources

MatchNodeSelector

MatchNodeSelector means that there are no nodes that match the Pod's label selector.
Check the labels under in the Pod specification's nodeSelector field
nodeSelector:
spec:
nodeSelector:
See the node labels
kubectl get nodes --show-labels
Add a label to the node as
kubectl label nodes <node_name> <label_name>=<label_value>

PodToleratesNodeTaints

PodToleratesNodeTaints says that the Pod can't be scheduled to any node because no node currently tolerates its node taint.
1. You can patch the node like this
kubectl patch node 10.ab.cd.efg -p '{"spec":{"unschedulable":false}}'
For example, kubectl patch node 10.20.30.123 -p '{"spec":{"unschedulable":false}}'
2. or you can remove taint like this
kubectl taint nodes <node_name> key:NoSchedule-
ref:

PodFitsHostPorts

Pod is already in use. Change the port name in the Pod specification.
spec:
containers:
ports:
hostPort:

Does not have minimum availability

If there is no availability even if a node has sufficient resources – the nodes might be in SchedulingDisabled or Cordoned status.
Get the nodes to see the status
kubectl get nodes
If the Scheduling is Disabled, try uncordoning the node
kubectl uncordon <node_name>
check that the pvc are in bound state.

Pods stuck in Terminating state

kubelet Unable to attach or mount volumes

kubelet Unable to attach or mount volumes: unmounted volumes=[config-volume], unattached volumes=[sriov-device-plugin-token-ts8p5 devicesock log config-volume device-info]: timed out waiting for the condition
sol: You should not mount single pvc on same pod twice. https://stackoverflow.com/q/69544012

References

Few troubleshooting guides