Kubernetes troubleshooting guide
kubectl
kubectl
The kubectl
command isn't found
kubectl
command isn't foundMake sure you are in the Deployment Unit (DU) or the master node of the cluster. kubectl utility won't be available in worker nodes.
If you are in DU and the kubectl is still not available, make sure to add kubectl to $PATH.
WARNING: Kubernetes configuration file is group/world-readable
To remove group readable permissions,
chmod g-r ~/.kube/config
To remove world readable setting
chmod o-r ~/.kube/config
refer: https://github.com/helm/helm/issues/9115
error: You must be logged in to the server
This happens even though, the configuration is available,
This means the kubectl interface is unable access credentials. A workaround is
Pod stuck in pending state
steps to troubleshoot pending state - https://containersolutions.github.io/runbooks/posts/kubernetes/pod-stuck-in-pending-status/
Deployed Workloads
CrashLoopBackOff
CrashLoopBackOff
A container is repeatedly crashing after the restarts. There are multiple reasons for this error. Take the help of pod logs for additional ideas.
kubectl logs <podname>
<podname> is the problematic pod. If the previous instance of the pod exists you can pass -p
flag for its logs too.
Crashed containers restart with an exponential delay of 5 minutes.
ref: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy
Check "Exit Code" of the crashed container
Describe the problematic pod as
kubectl describe pod <podname>
replace <podname> with pod name.
Check for
containers: CONTAINER_NAME: last state: exit code
fieldif the exit code is 1, the container crash is due to application crash
if the exit code is 0, verify the duration of the app run
Containers exit when the application's main process exits. If the app finishes up faster, the container might try to restart.
Example if the exit code is 0
STEP 1: Identify the problem
A pod in CrashLoopBackOff
status
STEP 2: Gather information
Gather information, and exit code such as ip, missing files, any error messages and exit code and reasons.
Example if the exit code is other than 0
or 1
0
or 1
STEP 1: Identify the problem
Exit Code: 137
for the container means Out of Memory exception.
A pod in CrashLoopBackOff
status
Pods comes once again to running
state
STEP 2: Gather information
Gather information, and exit code such as ip, missing files, any error messages and exit code and reasons.
STEP 3: Problem analysis
Here in the container the error is about the json file validation. So, some of the entries in the json file are without values hence failed the validation.
STEP 4: Resolution and Root cause
Here we will agree upon the resolution and possible root cause.
Connect to a running container
Shell into the pod:
kubectl exec -it <podname> -- /bin/bash
if there are multiple containers, use -c <container_name>
for a specific container.
Now, you can access the bash terminal of the pod's container where you can check for network, file access, and databases etc.
ImagePullBackOff
and ErrImagePull
ImagePullBackOff
and ErrImagePull
Container image cannot be loaded from the image registry.
If the image is not found
Make sure to check the following
Verify the image name
Verify the image tag. (
:latest
or no tag pulls the latest image. And old tags may not be available)If the image should either have full path. Also, check for the inherited Docker Hub (artifactory or harbor) registry links.
Try to pull the docker image via terminal:
SSH into the node (master or worker) generally,
ssh root@10.69.a.b
should work both in powershell or bashRun
docker pull <image name>
. For example,docker pull docker.io/nfvpe/sriov-device-plugin:latest
.
If this method works, you can either add https://kubernetes.io/docs/concepts/containers/images/#specifying-imagepullsecrets-on-a-pod . Also, the image pull secrets are for single namespace only.
Permission denied error
If the error is either "permission denied" or "no pull access" error, verify that you have access to the image.
In either case, check whether you could download the image with (or a similar method)
wget --user=a.b@
mavenir.com
--password=AKCrK13VegV5bKtVZ <artifact-url>
ref: https://kubernetes.io/docs/concepts/containers/images/#using-a-private-registry
You need to make sure that the group (organizational team) you are in has access to the registry.
Pod unschedulable
Pod cannot be scheduled due to insufficient resource or configuration errors.
Insufficient resources
Error messages can be like,
No nodes are available that match all of the predicates: Insufficient cpu (2) which means, on the two nodes there isn't enough CPU available to fulfill the pod's requests
The default CPU request is 100m or 10% of a CPU. spec: containers: resources: requests
spec can be updated as per the requirement.
Note: The system containers in the kube-system
also use the cluster resources
ref: https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu
MatchNodeSelector
MatchNodeSelector
means that there are no nodes that match the Pod's label selector.
Check the labels under in the Pod specification's nodeSelector
field
See the node labels
Add a label to the node as
ref: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
PodToleratesNodeTaints
PodToleratesNodeTaints
says that the Pod can't be scheduled to any node because no node currently tolerates its node taint.
1. You can patch the node like this
kubectl patch node 10.ab.cd.efg -p '{"spec":{"unschedulable":false}}'
For example, kubectl patch node 10.20.30.123 -p '{"spec":{"unschedulable":false}}'
2. or you can remove taint like this
kubectl taint nodes <node_name> key:NoSchedule-
ref:
1. Worker nodes have status of Ready,SchedulingDisabled · Issue #3713 · kubernetes/autoscaler
2. https://cloud.google.com/kubernetes-engine/docs/how-to/node-taints
3. https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/
PodFitsHostPorts
Pod is already in use. Change the port name in the Pod specification.
Does not have minimum availability
If there is no availability even if a node has sufficient resources – the nodes might be in SchedulingDisabled
or Cordoned
status.
Get the nodes to see the status
kubectl get nodes
If the Scheduling is Disabled, try uncordon
ing the node
kubectl uncordon <node_name>
pod has unbound immediate persistentvolumeclaims · Issue #237 · hashicorp/consul-helm (github.com)
check that the pvc are in bound
state.
Pods stuck in Terminating state
kubelet Unable to attach or mount volumes
kubelet Unable to attach or mount volumes: unmounted volumes=[config-volume], unattached volumes=[sriov-device-plugin-token-ts8p5 devicesock log config-volume device-info]: timed out waiting for the condition
sol: You should not mount single pvc on same pod twice. https://stackoverflow.com/q/69544012
References
Few troubleshooting guides
Last updated