# Kubernetes troubleshooting guide

## **`kubectl`**

### The **`kubectl`** command isn't found

Make sure you are in the Deployment Unit (DU) or the master node of the cluster. kubectl utility won't be available in worker nodes.

If you are in DU and the kubectl is still not available, make sure to add kubectl to $PATH.

### WARNING: Kubernetes configuration file is group/world-readable

To remove **group readable** permissions,

`chmod g-r ~/.kube/config`

To remove **world readable** setting\
`chmod o-r ~/.kube/config`

refer: <https://github.com/helm/helm/issues/9115>

### &#x20;error: You must be logged in to the server

```
[root@master0 ~]# kubectl get nodes
error: You must be logged in to the server (the server has asked for the client to provide credentials)
```

This happens even though, the configuration is available,

This means the kubectl interface is unable access credentials. A workaround is&#x20;

```
$ kubectl --token=<token> get nodes
```

Refer: <https://kubernetes.io/docs/reference/access-authn-authz/authentication/#option-2-use-the-token-option>

### Pod stuck in pending state

[ip-reconciler race condition with Whereabouts, leads to IP cleanup and duplicate IPs · Issue #162 · k8snetworkplumbingwg/whereabouts (github.com)](https://github.com/k8snetworkplumbingwg/whereabouts/issues/162)

steps to troubleshoot pending state - <https://containersolutions.github.io/runbooks/posts/kubernetes/pod-stuck-in-pending-status/>

## Deployed Workloads

### **`CrashLoopBackOff`**

A container is repeatedly crashing after the restarts. There are multiple reasons for this error. Take the help of pod logs for additional ideas.

`kubectl logs <podname>`

\<podname> is the problematic pod. If the previous instance of the pod exists you can pass `-p` flag for its logs too.

Crashed containers restart with an exponential delay of 5 minutes.

ref: <https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy>

#### Check "Exit Code" of the crashed container

1. Describe the problematic pod as

   `kubectl describe pod <podname>`

   replace \<podname> with pod name.
2. Check for `containers: CONTAINER_NAME: last state: exit code`  field
   1. if the exit code is 1, the container crash is due to application crash
   2. if the exit code is 0, verify the duration of the app run

Containers exit when the application's main process exits. If the app finishes up faster, the container might try to restart.

&#x20;**Example if the exit code is 0**

**STEP 1: Identify the problem**

A pod in `CrashLoopBackOff` status

```
[root@control-plane ~]# kubectl get pods -A | grep app
app-ns app-pods-dxaifa                                      3/3     Running            0          9h
```

***

**STEP 2: Gather information**

Gather information, and exit code such as ip, missing files, any error messages and exit code and reasons.

```
Containers:
  upf:
    Container ID:  docker://89faf;dakj;safjoiwqreqwrwqaaaafafrb7c826bee916de4d5eb
    Image:         docker..-distroless
    Image ID:      docker-pullable://example.com/abc/afdsafafsafs@sha512:afsadfdwwqrewqrqfasdsgasgasdfghjkl;wertyuio
    Port:          9013/TCP
    Host Port:     0/TCP
    Command:
      /bin/bash
    Args:
      -c
      ip6tables -F;;/bin/bash
  
    Restart Count:  137
```

#### **Example if the exit code is other than `0` or `1`**

**STEP 1: Identify the problem**

`Exit Code: 137`  for the container means Out of Memory exception.

A pod in `CrashLoopBackOff` status

```
[root@control-plane ~]# kubectl get pods -A | grep app
NAME                       READY   STATUS             RESTARTS   AGE   IP              NODE          NOMINATED NODE   READINESS GATES
app-pods-afdaba   2/3     CrashLoopBackOff   27         97m   10.20.30.40   10.39.12.56   <none>           <none>

```

Pods comes once again to `running` state

```
[root@leoc2-mwp-21-3-1-master0 ~]# upf1
NAME                       READY   STATUS      RESTARTS   AGE    IP              NODE          NOMINATED NODE   READINESS GATES
app-pods-afdaba   2/3     Running   27         97m   10.20.30.40   10.39.12.56   <none>           <none>
```

***

**STEP 2: Gather information**

Gather information, and exit code such as ip, missing files, any error messages and exit code and reasons.

***

**STEP 3: Problem analysis**

Here in the container the error is about the json file validation. So, some of the entries in the json file are without values hence failed the validation.

***

**STEP 4: Resolution and Root cause**

Here we will agree upon the resolution and possible root cause.

#### Connect to a running container

Shell into the pod:

`kubectl exec -it <podname>  -- /bin/bash`

if there are multiple containers, use `-c <container_name>`  for a specific container.

Now, you can access the bash terminal of the pod's container where you can check for network, file access, and databases etc.

### **`ImagePullBackOff`** and **`ErrImagePull`**

Container image cannot be loaded from the image registry.

#### If the image is not found

Make sure to check the following

1. 1. 1. Verify the image name
      2. Verify the image tag. ( `:latest`  or no tag pulls the latest image. And old tags may not be available)
      3. If the image should either have full path. Also, check for the inherited Docker Hub (artifactory or harbor) registry links.
      4. Try to pull the docker image via terminal:
         1. SSH into the node (master or worker)\
            generally, `ssh root@10.69.a.b`  should work both in powershell or bash
         2. Run `docker pull <image name>` .\
            For example, `docker pull docker.io/nfvpe/sriov-device-plugin:latest` .
      5. If this method works, you can either add <https://kubernetes.io/docs/concepts/containers/images/#specifying-imagepullsecrets-on-a-pod> . Also, the image pull secrets are for single namespace only.

#### Permission denied error

If the error is either "permission denied" or "no pull access" error, verify that you have access to the image.

In either case, check whether you could download the image with (or a similar method)

`wget --user=a.b@`[`mavenir.com`](http://mavenir.com/)  `--password=AKCrK13VegV5bKtVZ <artifact-url>`

ref: <https://kubernetes.io/docs/concepts/containers/images/#using-a-private-registry>

You need to make sure that the group (organizational team) you are in has access to the registry.

### Pod unschedulable

Pod cannot be scheduled due to insufficient resource or configuration errors.

#### Insufficient resources

Error messages can be like,

* No nodes are available that match all of the predicates: Insufficient cpu (2)\
  which means, on the two nodes there isn't enough CPU available to fulfill the pod's requests

The default CPU request is 100m or 10% of a CPU. `spec: containers: resources: requests`  spec can be updated as per the requirement.

Note: The system containers in the `kube-system` also use the cluster resources

ref: <https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu>

#### MatchNodeSelector

`MatchNodeSelector`  means that there are no nodes that match the Pod's label selector.

Check the labels under in the Pod specification's `nodeSelector` field

```
nodeSelector:
  spec:
    nodeSelector:
```

See the node labels

```
kubectl get nodes --show-labels
```

Add a label to the node as

```
kubectl label nodes <node_name> <label_name>=<label_value>
```

ref: <https://kubernetes.io/docs/concepts/configuration/assign-pod-node/>

#### PodToleratesNodeTaints

`PodToleratesNodeTaints`  says that the Pod can't be scheduled to any node because no node currently tolerates its node taint.

1\. You can patch the node like this

`kubectl patch node 10.ab.cd.efg -p '{"spec":{"unschedulable":false}}'`

For example, `kubectl patch node 10.20.30.123 -p '{"spec":{"unschedulable":false}}'`&#x20;

2\. or you can remove taint like this

`kubectl taint nodes <node_name> key:NoSchedule-`&#x20;

\
ref:

1\. [Worker nodes have status of Ready,SchedulingDisabled · Issue #3713 · kubernetes/autoscaler](https://github.com/kubernetes/autoscaler/issues/3713)

2\. <https://cloud.google.com/kubernetes-engine/docs/how-to/node-taints>

3\. <https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/>

#### PodFitsHostPorts

Pod is already in use. Change the port name in the Pod specification.

```
spec:
  containers:
    ports:
      hostPort:
```

#### Does not have minimum availability

If there is no availability even if a node has sufficient resources – the nodes might be in `SchedulingDisabled`  or `Cordoned`  status.

Get the nodes to see the status

`kubectl get nodes`&#x20;

If the Scheduling is Disabled, try `uncordon`ing the node

`kubectl uncordon <node_name>`&#x20;

[pod has unbound immediate persistentvolumeclaims · Issue #237 · hashicorp/consul-helm (github.com)](https://github.com/hashicorp/consul-helm/issues/237)

check that the pvc are in **`bound`**  state.

### Pods stuck in Terminating state

### kubelet Unable to attach or mount volumes

kubelet Unable to attach or mount volumes: unmounted volumes=\[config-volume], unattached volumes=\[sriov-device-plugin-token-ts8p5 devicesock log config-volume device-info]: timed out waiting for the condition

sol: You should not mount single pvc on same pod twice. <https://stackoverflow.com/q/69544012>

## References

Few troubleshooting guides

1. <https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/troubleshooting-kubeadm>
2. <https://cloud.google.com/kubernetes-engine/docs/troubleshooting>
