How to configure Kubernetes health probes: liveness, readiness, and startup

Kubernetes health probes tell the kubelet when to restart a container, when to stop sending it traffic, and when to wait for a slow boot. Misconfigured probes are one of the most common causes of CrashLoopBackOff and cascading outages. This article walks through all three probe types, the four probe mechanisms, timing parameters, and the configuration patterns that keep workloads stable in production.

What each probe type does and when it fires

Three probes, three different consequences on failure. Getting this wrong is the root cause of most probe-related incidents.

Liveness probe. Runs continuously at periodSeconds intervals. If the probe fails failureThreshold consecutive times, the kubelet restarts the container (not the pod; the pod object stays). Use it to recover from deadlocks or stuck processes that will never self-heal.

Readiness probe. Also runs continuously, not just at startup. On failure, the kubelet removes the pod from all matching Service endpoints. The container keeps running. Traffic stops arriving. Once the probe passes again, the pod is re-added. Use it to signal that a container is temporarily unable to serve requests.

Startup probe. Runs once at container start. Blocks liveness and readiness probes until it succeeds. If it fails failureThreshold consecutive times, the kubelet restarts the container. Once it passes, it never runs again. Use it for containers with slow or variable boot times.

Probe Failure action Runs when Blocks other probes?
Startup Container restart (after threshold) Once at startup Yes
Liveness Container restart (after threshold) Continuously No
Readiness Pod removed from endpoints Continuously No

A common misconception: readiness probes are not a startup-only mechanism. They run for the entire lifetime of the pod. A pod that was ready five minutes ago can become not-ready at any time if the probe starts failing.

The four probe mechanisms

Each probe type supports four mechanisms. Pick the one that matches your service's interface.

HTTP probe (httpGet)

The kubelet sends an HTTP GET request. Status codes 200 through 399 count as success. Anything else is a failure.

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
    httpHeaders:            # optional; kubelet sends User-Agent: kube-probe by default
      - name: X-Custom
        value: probe
  periodSeconds: 10
  failureThreshold: 3

Best for: any service exposing HTTP. The most common choice.

TCP probe (tcpSocket)

The kubelet attempts a TCP connection. Success means the connection was established. It does not verify the application is actually processing requests.

readinessProbe:
  tcpSocket:
    port: 5432
  periodSeconds: 10
  failureThreshold: 3

Best for: databases, message brokers, and other TCP services that do not expose HTTP.

Exec probe (exec)

The kubelet runs a command inside the container. Exit code 0 is success; anything else is failure.

livenessProbe:
  exec:
    command:
      - cat
      - /tmp/healthy
  periodSeconds: 10
  failureThreshold: 3

Known issue: exec probes spawn a child process for every execution. If your container's PID 1 is not an init system, those children become zombie processes. At low periodSeconds with many pods, this exhausts the PID space on the node. Use tini or dumb-init as PID 1 if you rely on exec probes.

Best for: custom health logic that cannot be expressed as HTTP or TCP. Prefer HTTP or TCP when possible.

gRPC probe (grpc)

The kubelet calls the gRPC Health Checking Protocol (grpc.health.v1.Health/Check). GA since Kubernetes 1.27; no feature gate required.

readinessProbe:
  grpc:
    port: 50051          # must be numeric; named ports are not supported
  periodSeconds: 10
  failureThreshold: 3

Limitations: no client certificate support, no TLS certificate validation, no service name chaining.

Best for: gRPC services that implement the standard health checking protocol.

Timing parameters

Six parameters control how fast probes fire, how long they wait, and how many failures trigger action.

Parameter Default What it controls
initialDelaySeconds 0 Seconds before the first probe fires
periodSeconds 10 Seconds between probe executions
timeoutSeconds 1 Seconds the kubelet waits for a response
successThreshold 1 Consecutive successes to mark healthy
failureThreshold 3 Consecutive failures before action
terminationGracePeriodSeconds Inherits pod-level Override for probe-triggered restarts (1.25+)

Default values are documented in the official Kubernetes probe configuration reference.

Two constraints to know:

  • successThreshold for liveness and startup probes must be 1. The API rejects any other value.
  • terminationGracePeriodSeconds at the probe level (available since Kubernetes 1.25) overrides the pod-level value. This is useful when a deadlock recovery should restart fast but a normal shutdown needs a long drain window.

Configure startup probes for slow-booting applications

Before the startup probe existed, the only option for slow-starting containers was a large initialDelaySeconds on the liveness probe. That creates a fixed wait even on fast startups and does not adapt to variable boot times.

The startup probe solves this. It gates liveness and readiness until the application explicitly signals that it has finished initializing. The formula:

failureThreshold x periodSeconds >= worst-case startup time

A Java application that takes up to 3 minutes to start:

startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  failureThreshold: 18     # 18 x 10 = 180 seconds = 3 minutes
  periodSeconds: 10
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 10
  failureThreshold: 3

Once the startup probe succeeds, it never runs again. The liveness probe takes over from that point.

Design your health endpoints

The single most important rule: never check external dependencies in a liveness probe. If your database goes down and every pod's liveness probe checks database connectivity, every pod restarts simultaneously. The restart storm compounds the outage instead of recovering from it.

Separate your liveness and readiness endpoints:

/healthz (liveness): returns 200 if the HTTP loop is alive. Checks nothing external. Think of it as "is this process stuck?" If the answer is no, return 200.

// Liveness: prove the process can respond
http.HandleFunc("/healthz", func(w http.ResponseWriter, r *http.Request) {
    w.WriteHeader(http.StatusOK)
})

/ready (readiness): returns 200 if the application has finished initialization and critical dependencies are reachable. Returns 503 during startup, cache warm-up, or dependency unavailability.

// Readiness: verify the app can serve real requests
http.HandleFunc("/ready", func(w http.ResponseWriter, r *http.Request) {
    if !appReady || !dbPool.Ping() {
        w.WriteHeader(http.StatusServiceUnavailable)
        return
    }
    w.WriteHeader(http.StatusOK)
})

Even for readiness, be careful with dependency checks. Set timeoutSeconds above the P99 response time of the dependency and raise failureThreshold above the default 3. A latency spike on a shared database should not simultaneously remove every pod from every Service.

Common misconfiguration patterns

Liveness probe kills slow-starting containers

Symptom: pod enters CrashLoopBackOff immediately after deployment. kubectl describe pod shows Liveness probe failed events before the application finishes booting.

Fix: add a startup probe with failureThreshold x periodSeconds covering the worst-case boot time. Remove any large initialDelaySeconds from the liveness probe.

Identical liveness and readiness configuration

Symptom: under load, pods are simultaneously removed from endpoints (readiness) and restarted (liveness). Active connections drop without graceful shutdown.

Fix: give liveness a higher failureThreshold than readiness. Readiness should be the first line of defense (stop traffic), liveness the last resort (restart).

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  periodSeconds: 5
  failureThreshold: 3      # removed from traffic after 15 seconds
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 10
  failureThreshold: 6      # restarted after 60 seconds, not 15

External dependency in liveness probe

Symptom: a transient database outage causes all pods to restart at once. Recovery takes minutes instead of seconds.

Fix: move the dependency check to the readiness probe. Keep the liveness endpoint internal-only.

Wrong path or port

Symptom: probe returns 404 or connection refused from the first attempt. Pod restarts before it ever serves traffic.

Diagnosis:

kubectl exec <pod> -- curl -v http://localhost:8080/healthz

Fix: match the probe path and port to what the application actually binds. Check your Dockerfile EXPOSE directive and your application's listen configuration. Note that Kubernetes ignores the Docker HEALTHCHECK directive entirely; it does not substitute for probe configuration.

Readiness timeout shorter than dependency latency

Symptom: all pods go not-ready during load spikes even though the application itself is functional. The readiness endpoint queries a dependency that responds in 1.2 seconds, but timeoutSeconds is 1 (the default).

Fix: set timeoutSeconds to comfortably exceed the P99 response time of whatever the readiness endpoint checks.

Probes and rolling updates

Without readiness probes, Kubernetes marks a pod Ready the moment the container starts. During a rolling update, that means the new pod receives traffic before the application has initialized.

With readiness probes, the rollout controller waits for each new pod to pass its readiness probe before routing traffic to it. The old pod is not terminated until the new one is Ready.

For zero-downtime rolling updates, combine readiness probes with these Deployment settings:

spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0    # never remove an old pod until a new one is ready
      maxSurge: 1           # allow one extra pod during the transition
  minReadySeconds: 30       # wait 30 seconds after readiness passes before proceeding

minReadySeconds adds a buffer after readiness succeeds. If the new pod crashes within that window, the rollout pauses instead of tearing down the next old pod.

Complete example: all three probes

A production deployment for a web application with a 90-second worst-case boot time:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  minReadySeconds: 15
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        - name: web-app
          image: registry.internal/web-app:4.2.1
          ports:
            - containerPort: 8080
          startupProbe:
            httpGet:
              path: /healthz
              port: 8080
            failureThreshold: 10   # 10 x 10 = 100 seconds max startup window
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8080
            periodSeconds: 10
            failureThreshold: 6    # restart after 60 seconds of failure
            timeoutSeconds: 2
            terminationGracePeriodSeconds: 10  # fast restart on deadlock (K8s 1.25+)
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            periodSeconds: 5
            failureThreshold: 3    # removed from traffic after 15 seconds
            timeoutSeconds: 3      # generous for dependency checks in /ready

Verify your probes are working

After deploying, confirm probe behavior:

# Check for probe-related events
kubectl describe pod <pod-name> | grep -A 5 "Events:"

# Watch pod status transitions
kubectl get pods -w

# Test the endpoint manually from inside the pod
kubectl exec <pod-name> -- curl -s -o /dev/null -w "%{http_code}" http://localhost:8080/healthz

Healthy output from kubectl describe pod shows no Unhealthy events under the events section. If you see Liveness probe failed or Readiness probe failed events, check the probe's path, port, and timing parameters against the application's actual behavior.

When to escalate

If probes keep failing after verifying the configuration is correct, collect the following before escalating:

  • Output of kubectl describe pod <pod-name> (full, not truncated)
  • Application logs: kubectl logs <pod-name> --previous (for the crashed container)
  • Node resource usage: kubectl top node and kubectl top pod
  • The exact probe configuration from the Deployment spec
  • Kubernetes version: kubectl version
  • Whether the failure is consistent or intermittent

Recurring server or deployment issues?

I help teams make production reliable with CI/CD, Kubernetes, and cloud—so fixes stick and deploys stop being stressful.

Explore DevOps consultancy

Search this site

Start typing to search, or browse the knowledge base and blog.