Kubernetes Troubleshooting

Reference articles for the Kubernetes error states that show up most often in production: pods stuck in CrashLoopBackOff, containers killed by OOM, images that refuse to pull, nodes that go NotReady, and the cascade of scheduling failures that follows.

Each article starts from the symptom you actually see in kubectl get pods or your alerting, walks through the likely causes in order of probability, and ends with a verification step so you know the fix stuck.

Scroll

How to configure Kubernetes health probes: liveness, readiness, and startup
Kubernetes health probes tell the kubelet when to restart a container, when to stop sending it traffic, and when to wait for a slow boot. Misconfigured probes are one of the most common causes of CrashLoopBackOff and cascading outages. This article walks through all three probe types, the four probe mechanisms, timing parameters, and the configuration patterns that keep workloads stable in production.

1747 words
CrashLoopBackOff: why your Kubernetes pod keeps restarting
CrashLoopBackOff is not an error. It is a status that tells you a container inside your pod is starting, crashing, and being restarted in a loop with increasing delays. This article walks through what the status means, how to read exit codes and logs, the most common root causes, and how to fix each one.

2001 words
ImagePullBackOff: fixing Kubernetes container image pull failures
ImagePullBackOff means the kubelet failed to pull a container image and is retrying with exponential backoff. The root cause is always in the Events section of kubectl describe pod: a typo in the image reference, missing registry credentials, Docker Hub rate limits, or a network problem between the node and the registry. This article walks through each cause, how to diagnose it, and how to fix it.

1930 words
ContainerCreating stuck: debugging pods that never start
ContainerCreating means the kubelet is setting up your pod's prerequisites (volumes, network, secrets) but something is blocking it. Unlike CrashLoopBackOff, the container never actually starts. The fix depends on which prerequisite is stuck: a PVC that will not bind, a missing Secret, a broken CNI plugin, or an init container that never finishes. This article walks through each cause, how to identify it from kubectl events, and how to resolve it.

2231 words
OOMKilled: Kubernetes out of memory errors explained
OOMKilled means the Linux kernel terminated your container because it exceeded its memory limit. The container exits with code 137 (SIGKILL), the kubelet restarts it, and without intervention it will keep dying in a loop. This article covers how OOMKilled works at the kernel level, how to distinguish it from node-level OOM and eviction, how to diagnose the root cause, and how to right-size memory limits for JVM, Go, Node.js and Python workloads.

2355 words
Pod stuck in Pending: why Kubernetes cannot schedule your workload
A pod in Pending state has been accepted by the API server but no node can run it yet. The scheduler evaluated every node, found zero that pass all filters, and is waiting for conditions to change. The fix depends entirely on which filter failed: insufficient CPU or memory, a taint without a matching toleration, a node affinity mismatch, an unbound PersistentVolumeClaim, or a ResourceQuota that blocks pod creation before scheduling even starts.

2168 words
Node NotReady: diagnosing Kubernetes node failures
A node in NotReady state has stopped sending heartbeats to the control plane. The kubelet is either down, unreachable, or actively reporting that a health condition has failed. Pods on the node face eviction within five minutes. This article covers how to read node conditions, diagnose the root cause (kubelet crash, container runtime failure, resource pressure, network partition, certificate expiry), and recover or replace the node safely.

1870 words
kubectl debug and ephemeral containers: debugging running pods
Distroless and minimal container images have no shell, no package manager, and no debugging tools. kubectl exec fails immediately. kubectl debug solves this by injecting an ephemeral container with the tools you need into a running pod, without restarting it. This guide covers the three kubectl debug modes: ephemeral containers with --target, pod copies with --copy-to, and node-level debugging.

1673 words
Kubernetes DNS troubleshooting: CoreDNS failures and resolution issues
When pods cannot resolve DNS names, nothing works. Service-to-service calls fail, external API requests time out, and the application logs fill up with connection errors. The root cause sits somewhere in the DNS chain: the pod's /etc/resolv.conf, the kube-dns Service, CoreDNS itself, or the upstream resolver. This article walks through each layer with concrete diagnostic commands and fixes.

2322 words
Kubernetes graceful shutdown: handling SIGTERM and pod termination
When Kubernetes terminates a pod, your application has a limited window to drain connections, finish in-flight requests, and clean up resources before it is forcefully killed. Getting this wrong is the most common source of 502 errors during deployments. This article covers the pod termination lifecycle, the endpoint removal race condition, preStop hooks, signal handling in Go, Node.js, Java, and Python, and how to test that your shutdown is actually graceful.

1745 words
Kubernetes namespace stuck in Terminating: how to find and fix the finalizer holding it
kubectl delete namespace returns immediately, but the namespace then sits in Terminating forever. The cause is almost always a finalizer that no controller is removing, either on a resource inside the namespace or on the namespace's own kubernetes finalizer when an APIService is unavailable. This article walks through the finalizer mechanism, how to identify the exact resource blocking deletion, and how to remove the finalizer without skipping the cleanup work it was guarding.

2694 words
Migrating from ingress-nginx to Gateway API
The ingress-nginx repository was archived on March 24, 2026. No more security patches, no bug fixes, no releases. If your cluster still relies on ingress-nginx for L7 traffic routing, migrating to Gateway API is no longer a nice-to-have. This guide walks through the full migration: choosing an implementation, converting manifests with ingress2gateway, running both controllers in parallel, wiring up cert-manager, and cutting over DNS without downtime.

2212 words
Kubernetes pod eviction: node pressure, disk pressure, and Evicted status
Evicted pods in kubectl get pods are the kubelet's signal that a node ran out of memory, disk, or PIDs. The kubelet picks pods to terminate using a three-step ranking that does not use QoS class as a direct input, then sets status.phase=Failed and status.reason=Evicted. This article covers how to read the eviction reason, clean up the leftover pod objects, identify which pressure caused it, and stop it from happening again.

3153 words
Pod Pending: "didn't match pod topology spread constraints" with untolerated taints
A pod with topology spread constraints stays Pending even though the cluster has free capacity. The scheduler reports both untolerated taints and unsatisfied topology spread constraints in the same FailedScheduling event. The cause is the default nodeTaintsPolicy: Ignore, which counts unreachable tainted nodes in the spread math and creates a deadlock in multi-tenant clusters. The fix is to set nodeTaintsPolicy: Honor on the constraint.

1652 words

Recurring server or deployment issues?

I help teams make production reliable with CI/CD, Kubernetes, and cloud—so fixes stick and deploys stop being stressful.

Explore DevOps consultancy

Kubernetes Troubleshooting

Articles

Recurring server or deployment issues?