Kubernetes troubleshooting guide
Last updated
Was this helpful?
Last updated
Was this helpful?
kubectl
kubectl
command isn't foundMake sure you are in the Deployment Unit (DU) or the master node of the cluster. kubectl utility won't be available in worker nodes.
If you are in DU and the kubectl is still not available, make sure to add kubectl to $PATH.
To remove group readable permissions,
chmod g-r ~/.kube/config
To remove world readable setting
chmod o-r ~/.kube/config
refer:
This happens even though, the configuration is available,
This means the kubectl interface is unable access credentials. A workaround is
CrashLoopBackOff
A container is repeatedly crashing after the restarts. There are multiple reasons for this error. Take the help of pod logs for additional ideas.
kubectl logs <podname>
<podname> is the problematic pod. If the previous instance of the pod exists you can pass -p
flag for its logs too.
Crashed containers restart with an exponential delay of 5 minutes.
Describe the problematic pod as
kubectl describe pod <podname>
replace <podname> with pod name.
Check for containers: CONTAINER_NAME: last state: exit code
field
if the exit code is 1, the container crash is due to application crash
if the exit code is 0, verify the duration of the app run
Containers exit when the application's main process exits. If the app finishes up faster, the container might try to restart.
Example if the exit code is 0
STEP 1: Identify the problem
A pod in CrashLoopBackOff
status
STEP 2: Gather information
Gather information, and exit code such as ip, missing files, any error messages and exit code and reasons.
0
or 1
STEP 1: Identify the problem
Exit Code: 137
for the container means Out of Memory exception.
A pod in CrashLoopBackOff
status
Pods comes once again to running
state
STEP 2: Gather information
Gather information, and exit code such as ip, missing files, any error messages and exit code and reasons.
STEP 3: Problem analysis
Here in the container the error is about the json file validation. So, some of the entries in the json file are without values hence failed the validation.
STEP 4: Resolution and Root cause
Here we will agree upon the resolution and possible root cause.
Shell into the pod:
kubectl exec -it <podname> -- /bin/bash
if there are multiple containers, use -c <container_name>
for a specific container.
Now, you can access the bash terminal of the pod's container where you can check for network, file access, and databases etc.
ImagePullBackOff
and ErrImagePull
Container image cannot be loaded from the image registry.
Make sure to check the following
Verify the image name
Verify the image tag. ( :latest
or no tag pulls the latest image. And old tags may not be available)
If the image should either have full path. Also, check for the inherited Docker Hub (artifactory or harbor) registry links.
Try to pull the docker image via terminal:
SSH into the node (master or worker)
generally, ssh root@10.69.a.b
should work both in powershell or bash
Run docker pull <image name>
.
For example, docker pull docker.io/nfvpe/sriov-device-plugin:latest
.
If the error is either "permission denied" or "no pull access" error, verify that you have access to the image.
In either case, check whether you could download the image with (or a similar method)
You need to make sure that the group (organizational team) you are in has access to the registry.
Pod cannot be scheduled due to insufficient resource or configuration errors.
Error messages can be like,
No nodes are available that match all of the predicates: Insufficient cpu (2) which means, on the two nodes there isn't enough CPU available to fulfill the pod's requests
The default CPU request is 100m or 10% of a CPU. spec: containers: resources: requests
spec can be updated as per the requirement.
Note: The system containers in the kube-system
also use the cluster resources
MatchNodeSelector
means that there are no nodes that match the Pod's label selector.
Check the labels under in the Pod specification's nodeSelector
field
See the node labels
Add a label to the node as
PodToleratesNodeTaints
says that the Pod can't be scheduled to any node because no node currently tolerates its node taint.
1. You can patch the node like this
kubectl patch node 10.ab.cd.efg -p '{"spec":{"unschedulable":false}}'
For example, kubectl patch node 10.20.30.123 -p '{"spec":{"unschedulable":false}}'
2. or you can remove taint like this
kubectl taint nodes <node_name> key:NoSchedule-
ref:
Pod is already in use. Change the port name in the Pod specification.
If there is no availability even if a node has sufficient resources – the nodes might be in SchedulingDisabled
or Cordoned
status.
Get the nodes to see the status
kubectl get nodes
If the Scheduling is Disabled, try uncordon
ing the node
kubectl uncordon <node_name>
check that the pvc are in bound
state.
kubelet Unable to attach or mount volumes: unmounted volumes=[config-volume], unattached volumes=[sriov-device-plugin-token-ts8p5 devicesock log config-volume device-info]: timed out waiting for the condition
Few troubleshooting guides
Refer:
steps to troubleshoot pending state -
ref:
If this method works, you can either add . Also, the image pull secrets are for single namespace only.
wget --user=a.b@
--password=AKCrK13VegV5bKtVZ <artifact-url>
ref:
ref:
ref:
1.
2.
3.
sol: You should not mount single pvc on same pod twice.