Kubernetes Troubleshooting

Reference articles for the Kubernetes error states that show up most often in production: pods stuck in CrashLoopBackOff, containers killed by OOM, images that refuse to pull, nodes that go NotReady, and the cascade of scheduling failures that follows.

Each article starts from the symptom you actually see in kubectl get pods or your alerting, walks through the likely causes in order of probability, and ends with a verification step so you know the fix stuck.

Scroll

Articles

  1. How to configure Kubernetes health probes: liveness, readiness, and startup

    Kubernetes health probes tell the kubelet when to restart a container, when to stop sending it traffic, and when to wait for a slow boot. Misconfigured probes are one of the most common causes of CrashLoopBackOff and cascading outages. This article walks through all three probe types, the four probe mechanisms, timing parameters, and the configuration patterns that keep workloads stable in production.

    1690 words
  2. CrashLoopBackOff: why your Kubernetes pod keeps restarting

    CrashLoopBackOff is not an error. It is a status that tells you a container inside your pod is starting, crashing, and being restarted in a loop with increasing delays. This article walks through what the status means, how to read exit codes and logs, the most common root causes, and how to fix each one.

    2001 words
  3. ImagePullBackOff: fixing Kubernetes container image pull failures

    ImagePullBackOff means the kubelet failed to pull a container image and is retrying with exponential backoff. The root cause is always in the Events section of kubectl describe pod: a typo in the image reference, missing registry credentials, Docker Hub rate limits, or a network problem between the node and the registry. This article walks through each cause, how to diagnose it, and how to fix it.

    1930 words
  4. ContainerCreating stuck: debugging pods that never start

    ContainerCreating means the kubelet is setting up your pod's prerequisites (volumes, network, secrets) but something is blocking it. Unlike CrashLoopBackOff, the container never actually starts. The fix depends on which prerequisite is stuck: a PVC that will not bind, a missing Secret, a broken CNI plugin, or an init container that never finishes. This article walks through each cause, how to identify it from kubectl events, and how to resolve it.

    2231 words
  5. OOMKilled: Kubernetes out of memory errors explained

    OOMKilled means the Linux kernel terminated your container because it exceeded its memory limit. The container exits with code 137 (SIGKILL), the kubelet restarts it, and without intervention it will keep dying in a loop. This article covers how OOMKilled works at the kernel level, how to distinguish it from node-level OOM and eviction, how to diagnose the root cause, and how to right-size memory limits for JVM, Go, Node.js and Python workloads.

    2274 words
  6. Pod stuck in Pending: why Kubernetes cannot schedule your workload

    A pod in Pending state has been accepted by the API server but no node can run it yet. The scheduler evaluated every node, found zero that pass all filters, and is waiting for conditions to change. The fix depends entirely on which filter failed: insufficient CPU or memory, a taint without a matching toleration, a node affinity mismatch, an unbound PersistentVolumeClaim, or a ResourceQuota that blocks pod creation before scheduling even starts.

    1991 words
  7. Node NotReady: diagnosing Kubernetes node failures

    A node in NotReady state has stopped sending heartbeats to the control plane. The kubelet is either down, unreachable, or actively reporting that a health condition has failed. Pods on the node face eviction within five minutes. This article covers how to read node conditions, diagnose the root cause (kubelet crash, container runtime failure, resource pressure, network partition, certificate expiry), and recover or replace the node safely.

    1837 words
  8. kubectl debug and ephemeral containers: debugging running pods

    Distroless and minimal container images have no shell, no package manager, and no debugging tools. kubectl exec fails immediately. kubectl debug solves this by injecting an ephemeral container with the tools you need into a running pod, without restarting it. This guide covers the three kubectl debug modes: ephemeral containers with --target, pod copies with --copy-to, and node-level debugging.

    1673 words
  9. Kubernetes DNS troubleshooting: CoreDNS failures and resolution issues

    When pods cannot resolve DNS names, nothing works. Service-to-service calls fail, external API requests time out, and the application logs fill up with connection errors. The root cause sits somewhere in the DNS chain: the pod's /etc/resolv.conf, the kube-dns Service, CoreDNS itself, or the upstream resolver. This article walks through each layer with concrete diagnostic commands and fixes.

    2322 words
  10. Kubernetes graceful shutdown: handling SIGTERM and pod termination

    When Kubernetes terminates a pod, your application has a limited window to drain connections, finish in-flight requests, and clean up resources before it is forcefully killed. Getting this wrong is the most common source of 502 errors during deployments. This article covers the pod termination lifecycle, the endpoint removal race condition, preStop hooks, signal handling in Go, Node.js, Java, and Python, and how to test that your shutdown is actually graceful.

    1745 words
  11. Migrating from ingress-nginx to Gateway API

    The ingress-nginx repository was archived on March 24, 2026. No more security patches, no bug fixes, no releases. If your cluster still relies on ingress-nginx for L7 traffic routing, migrating to Gateway API is no longer a nice-to-have. This guide walks through the full migration: choosing an implementation, converting manifests with ingress2gateway, running both controllers in parallel, wiring up cert-manager, and cutting over DNS without downtime.

    2212 words

Recurring server or deployment issues?

I help teams make production reliable with CI/CD, Kubernetes, and cloud—so fixes stick and deploys stop being stressful.

Explore DevOps consultancy

Search this site

Start typing to search, or browse the knowledge base and blog.