What each probe type does and when it fires
Three probes, three different consequences on failure. Getting this wrong is the root cause of most probe-related incidents.
Liveness probe. Runs continuously at periodSeconds intervals. If the probe fails failureThreshold consecutive times, the kubelet restarts the container (not the pod; the pod object stays). Use it to recover from deadlocks or stuck processes that will never self-heal.
Readiness probe. Also runs continuously, not just at startup. On failure, the kubelet removes the pod from all matching Service endpoints. The container keeps running. Traffic stops arriving. Once the probe passes again, the pod is re-added. Use it to signal that a container is temporarily unable to serve requests.
Startup probe. Runs once at container start. Blocks liveness and readiness probes until it succeeds. If it fails failureThreshold consecutive times, the kubelet restarts the container. Once it passes, it never runs again. Use it for containers with slow or variable boot times.
| Probe | Failure action | Runs when | Blocks other probes? |
|---|---|---|---|
| Startup | Container restart (after threshold) | Once at startup | Yes |
| Liveness | Container restart (after threshold) | Continuously | No |
| Readiness | Pod removed from endpoints | Continuously | No |
A common misconception: readiness probes are not a startup-only mechanism. They run for the entire lifetime of the pod. A pod that was ready five minutes ago can become not-ready at any time if the probe starts failing.
The four probe mechanisms
Each probe type supports four mechanisms. Pick the one that matches your service's interface.
HTTP probe (httpGet)
The kubelet sends an HTTP GET request. Status codes 200 through 399 count as success. Anything else is a failure.
livenessProbe:
httpGet:
path: /healthz
port: 8080
httpHeaders: # optional; kubelet sends User-Agent: kube-probe by default
- name: X-Custom
value: probe
periodSeconds: 10
failureThreshold: 3
Best for: any service exposing HTTP. The most common choice.
TCP probe (tcpSocket)
The kubelet attempts a TCP connection. Success means the connection was established. It does not verify the application is actually processing requests.
readinessProbe:
tcpSocket:
port: 5432
periodSeconds: 10
failureThreshold: 3
Best for: databases, message brokers, and other TCP services that do not expose HTTP.
Exec probe (exec)
The kubelet runs a command inside the container. Exit code 0 is success; anything else is failure.
livenessProbe:
exec:
command:
- cat
- /tmp/healthy
periodSeconds: 10
failureThreshold: 3
Known issue: exec probes spawn a child process for every execution. If your container's PID 1 is not an init system, those children become zombie processes. At low periodSeconds with many pods, this exhausts the PID space on the node. Use tini or dumb-init as PID 1 if you rely on exec probes.
Best for: custom health logic that cannot be expressed as HTTP or TCP. Prefer HTTP or TCP when possible.
gRPC probe (grpc)
The kubelet calls the gRPC Health Checking Protocol (grpc.health.v1.Health/Check). GA since Kubernetes 1.27; no feature gate required.
readinessProbe:
grpc:
port: 50051 # must be numeric; named ports are not supported
periodSeconds: 10
failureThreshold: 3
Limitations: no client certificate support, no TLS certificate validation, no service name chaining.
Best for: gRPC services that implement the standard health checking protocol.
Timing parameters
Six parameters control how fast probes fire, how long they wait, and how many failures trigger action.
| Parameter | Default | What it controls |
|---|---|---|
initialDelaySeconds |
0 | Seconds before the first probe fires |
periodSeconds |
10 | Seconds between probe executions |
timeoutSeconds |
1 | Seconds the kubelet waits for a response |
successThreshold |
1 | Consecutive successes to mark healthy |
failureThreshold |
3 | Consecutive failures before action |
terminationGracePeriodSeconds |
Inherits pod-level | Override for probe-triggered restarts (1.25+) |
Default values are documented in the official Kubernetes probe configuration reference.
Two constraints to know:
successThresholdfor liveness and startup probes must be 1. The API rejects any other value.terminationGracePeriodSecondsat the probe level (available since Kubernetes 1.25) overrides the pod-level value. This is useful when a deadlock recovery should restart fast but a normal shutdown needs a long drain window.
Configure startup probes for slow-booting applications
Before the startup probe existed, the only option for slow-starting containers was a large initialDelaySeconds on the liveness probe. That creates a fixed wait even on fast startups and does not adapt to variable boot times.
The startup probe solves this. It gates liveness and readiness until the application explicitly signals that it has finished initializing. The formula:
failureThreshold x periodSeconds >= worst-case startup time
A Java application that takes up to 3 minutes to start:
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 18 # 18 x 10 = 180 seconds = 3 minutes
periodSeconds: 10
livenessProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 10
failureThreshold: 3
Once the startup probe succeeds, it never runs again. The liveness probe takes over from that point.
Design your health endpoints
The single most important rule: never check external dependencies in a liveness probe. If your database goes down and every pod's liveness probe checks database connectivity, every pod restarts simultaneously. The restart storm compounds the outage instead of recovering from it.
Recommended endpoint pattern
Separate your liveness and readiness endpoints:
/healthz (liveness): returns 200 if the HTTP loop is alive. Checks nothing external. Think of it as "is this process stuck?" If the answer is no, return 200.
// Liveness: prove the process can respond
http.HandleFunc("/healthz", func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusOK)
})
/ready (readiness): returns 200 if the application has finished initialization and critical dependencies are reachable. Returns 503 during startup, cache warm-up, or dependency unavailability.
// Readiness: verify the app can serve real requests
http.HandleFunc("/ready", func(w http.ResponseWriter, r *http.Request) {
if !appReady || !dbPool.Ping() {
w.WriteHeader(http.StatusServiceUnavailable)
return
}
w.WriteHeader(http.StatusOK)
})
Even for readiness, be careful with dependency checks. Set timeoutSeconds above the P99 response time of the dependency and raise failureThreshold above the default 3. A latency spike on a shared database should not simultaneously remove every pod from every Service.
Common misconfiguration patterns
Liveness probe kills slow-starting containers
Symptom: pod enters CrashLoopBackOff immediately after deployment. kubectl describe pod shows Liveness probe failed events before the application finishes booting.
Fix: add a startup probe with failureThreshold x periodSeconds covering the worst-case boot time. Remove any large initialDelaySeconds from the liveness probe.
Identical liveness and readiness configuration
Symptom: under load, pods are simultaneously removed from endpoints (readiness) and restarted (liveness). Active connections drop without graceful shutdown.
Fix: give liveness a higher failureThreshold than readiness. Readiness should be the first line of defense (stop traffic), liveness the last resort (restart).
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5
failureThreshold: 3 # removed from traffic after 15 seconds
livenessProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 10
failureThreshold: 6 # restarted after 60 seconds, not 15
External dependency in liveness probe
Symptom: a transient database outage causes all pods to restart at once. Recovery takes minutes instead of seconds.
Fix: move the dependency check to the readiness probe. Keep the liveness endpoint internal-only.
Wrong path or port
Symptom: probe returns 404 or connection refused from the first attempt. Pod restarts before it ever serves traffic.
Diagnosis:
kubectl exec <pod> -- curl -v http://localhost:8080/healthz
Fix: match the probe path and port to what the application actually binds. Check your Dockerfile EXPOSE directive and your application's listen configuration. Note that Kubernetes ignores the Docker HEALTHCHECK directive entirely; it does not substitute for probe configuration.
Readiness timeout shorter than dependency latency
Symptom: all pods go not-ready during load spikes even though the application itself is functional. The readiness endpoint queries a dependency that responds in 1.2 seconds, but timeoutSeconds is 1 (the default).
Fix: set timeoutSeconds to comfortably exceed the P99 response time of whatever the readiness endpoint checks.
Probes and rolling updates
Without readiness probes, Kubernetes marks a pod Ready the moment the container starts. During a rolling update, that means the new pod receives traffic before the application has initialized.
With readiness probes, the rollout controller waits for each new pod to pass its readiness probe before routing traffic to it. The old pod is not terminated until the new one is Ready.
For zero-downtime rolling updates, combine readiness probes with these Deployment settings:
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0 # never remove an old pod until a new one is ready
maxSurge: 1 # allow one extra pod during the transition
minReadySeconds: 30 # wait 30 seconds after readiness passes before proceeding
minReadySeconds adds a buffer after readiness succeeds. If the new pod crashes within that window, the rollout pauses instead of tearing down the next old pod.
Complete example: all three probes
A production deployment for a web application with a 90-second worst-case boot time:
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 1
minReadySeconds: 15
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
spec:
terminationGracePeriodSeconds: 60
containers:
- name: web-app
image: registry.internal/web-app:4.2.1
ports:
- containerPort: 8080
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 10 # 10 x 10 = 100 seconds max startup window
periodSeconds: 10
livenessProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 10
failureThreshold: 6 # restart after 60 seconds of failure
timeoutSeconds: 2
terminationGracePeriodSeconds: 10 # fast restart on deadlock (K8s 1.25+)
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5
failureThreshold: 3 # removed from traffic after 15 seconds
timeoutSeconds: 3 # generous for dependency checks in /ready
Verify your probes are working
After deploying, confirm probe behavior:
# Check for probe-related events
kubectl describe pod <pod-name> | grep -A 5 "Events:"
# Watch pod status transitions
kubectl get pods -w
# Test the endpoint manually from inside the pod
kubectl exec <pod-name> -- curl -s -o /dev/null -w "%{http_code}" http://localhost:8080/healthz
Healthy output from kubectl describe pod shows no Unhealthy events under the events section. If you see Liveness probe failed or Readiness probe failed events, check the probe's path, port, and timing parameters against the application's actual behavior.
When to escalate
If probes keep failing after verifying the configuration is correct, collect the following before escalating:
- Output of
kubectl describe pod <pod-name>(full, not truncated) - Application logs:
kubectl logs <pod-name> --previous(for the crashed container) - Node resource usage:
kubectl top nodeandkubectl top pod - The exact probe configuration from the Deployment spec
- Kubernetes version:
kubectl version - Whether the failure is consistent or intermittent