The pod termination lifecycle
When a pod delete request arrives (rolling update, scale-in, node eviction, kubectl delete), Kubernetes runs a specific sequence of events. The critical detail: two parallel tracks start simultaneously, and their interaction is the root cause of most shutdown-related errors.
Track A (kubelet):
- The API server sets
deletionTimestampon the pod. The pod entersTerminatingstate. - The kubelet on the pod's node executes the preStop hook (if configured).
- After the preStop hook completes, the kubelet sends
SIGTERMto PID 1 in each container. - If the container does not exit within
terminationGracePeriodSeconds, the kubelet sendsSIGKILL.
Track B (network):
- The endpoint controller removes the pod from EndpointSlices.
- The API server propagates the change to every kube-proxy instance.
- Each kube-proxy updates its iptables/ipvs rules to stop routing to the pod.
- Ingress controllers refresh their upstream lists.
Both tracks start at the same moment. Neither waits for the other. That parallelism is the problem.
The endpoint removal race condition
Track A and Track B race against each other. If your application shuts down (Track A) before all kube-proxy instances finish updating their routing rules (Track B), requests still land on a pod that is no longer listening. The result: 502 Bad Gateway errors during rolling updates.
In small clusters, endpoint propagation might finish in under a second. In large clusters with 100+ nodes, or with ingress controllers that use polling instead of watches, it can take 10 to 30 seconds. During that window, traffic keeps arriving at a pod that is already shutting down.
Since Kubernetes v1.28, KEP-1669 (ProxyTerminatingEndpoints) is stable. kube-proxy falls back to terminating pods that are still serving when no ready endpoints exist. This reduces traffic black-holes during rolling updates, but it also means Kubernetes may keep sending traffic to a shutting-down pod. Your application handling shutdown gracefully matters more, not less.
Step 1: add a preStop hook
The preStop hook runs before SIGTERM is sent. A short sleep inside it gives the endpoint propagation machinery time to finish removing the pod from all routing tables.
For Kubernetes 1.30+ (native sleep action, no shell binary required):
lifecycle:
preStop:
sleep:
seconds: 15 # delay SIGTERM until endpoints are updated
For Kubernetes < 1.30 (requires sleep binary in the container):
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 15"]
The sleep does not drain connections itself. It delays the moment SIGTERM arrives, buying time for kube-proxy and ingress controllers to stop routing new traffic to the pod.
Recommended sleep values:
| Cluster profile | Sleep duration |
|---|---|
| Small cluster (< 50 nodes) | 5 to 10 seconds |
| Medium cluster (50 to 100 nodes) | 10 to 15 seconds |
| Large cluster (100+ nodes) or external load balancers | 15 to 30 seconds |
There is no universal correct value. Measure endpoint propagation latency in your cluster during testing.
Step 2: set terminationGracePeriodSeconds
The grace period is a shared budget. It starts counting the moment the pod enters Terminating, and it covers both the preStop hook and the application's own shutdown time. When it expires, the kubelet sends SIGKILL.
The formula:
terminationGracePeriodSeconds >= preStop_duration + app_shutdown_duration + safety_buffer
For a stateless HTTP service with a 15-second preStop sleep and a 20-second drain window, that gives you 15 + 20 + 10 = 45 seconds minimum:
spec:
terminationGracePeriodSeconds: 60
containers:
- name: app
image: my-app:v2.4.1
lifecycle:
preStop:
sleep:
seconds: 15
Recommended values by workload type:
| Workload | Grace period | Rationale |
|---|---|---|
| Stateless HTTP microservice | 45 to 60s | preStop sleep + request drain + buffer |
| WebSocket / long-poll service | 60 to 120s | Long-lived connections need time to drain |
| Batch worker / job | 120 to 300s | May be mid-chunk of large work |
| Stateful workload (database) | 60 to 120s | Flush writes, close WAL, replication handoff |
The default is 30 seconds. For most production workloads with a preStop sleep, that default is too low.
Step 3: handle SIGTERM in your application
Kubernetes sends SIGTERM to PID 1 inside the container. Your application must catch it, stop accepting new connections, finish in-flight requests, close resources, and exit cleanly.
The PID 1 requirement
If your application is not PID 1, it will not receive the signal. This is a common trap with Dockerfiles:
# WRONG: /bin/sh is PID 1, does not forward SIGTERM on Alpine
CMD myapp --flag
# CORRECT: myapp is PID 1
CMD ["myapp", "--flag"]
If you must use a shell entrypoint script, replace the shell process with exec:
#!/bin/sh
# setup steps here
exec myapp "$@" # exec replaces the shell with myapp (same PID)
For containers where neither option works, use tini as a minimal init process that forwards signals correctly:
FROM debian:bookworm-slim
RUN apt-get update && apt-get install -y tini
ENTRYPOINT ["/usr/bin/tini", "--"]
CMD ["myapp"]
Go
Go's signal.NotifyContext provides an idiomatic way to tie SIGTERM to context cancellation. The standard library's http.Server.Shutdown stops accepting new connections and waits for in-flight requests to finish:
ctx, stop := signal.NotifyContext(context.Background(), syscall.SIGTERM, syscall.SIGINT)
defer stop()
srv := &http.Server{
Addr: ":8080",
Handler: mux,
BaseContext: func(net.Listener) context.Context { return ctx },
}
go func() {
if err := srv.ListenAndServe(); err != http.ErrServerClosed {
log.Fatalf("ListenAndServe: %v", err)
}
}()
<-ctx.Done()
log.Println("SIGTERM received, draining...")
shutdownCtx, cancel := context.WithTimeout(context.Background(), 20*time.Second)
defer cancel()
if err := srv.Shutdown(shutdownCtx); err != nil {
log.Printf("Shutdown error: %v", err)
srv.Close() // force-close if drain takes too long
}
Set the Shutdown() timeout to less than terminationGracePeriodSeconds minus the preStop sleep duration.
Node.js
server.close() stops accepting new connections but does not close idle HTTP keep-alive connections. Load balancers maintain persistent connections to your pods, and those will never close on their own. You must destroy them explicitly:
const server = http.createServer(app);
let isShuttingDown = false;
// Track connections to handle keep-alive sockets
const connections = new Set();
server.on('connection', (socket) => {
connections.add(socket);
socket.on('close', () => connections.delete(socket));
});
function gracefulShutdown(signal) {
if (isShuttingDown) return;
isShuttingDown = true;
console.log(`${signal} received, draining...`);
server.close(() => {
console.log('Server closed');
process.exit(0);
});
// Destroy idle keep-alive connections
for (const socket of connections) {
socket.destroy();
}
setTimeout(() => process.exit(1), 25000); // hard deadline
}
process.on('SIGTERM', () => gracefulShutdown('SIGTERM'));
A common PID 1 mistake in Node.js: CMD ["npm", "start"] in the Dockerfile makes npm PID 1 instead of your application. npm does not forward SIGTERM. Use CMD ["node", "server.js"] directly.
Java (Spring Boot)
Spring Boot 2.3+ has built-in graceful shutdown support. In application.yml:
server:
shutdown: graceful
spring:
lifecycle:
timeout-per-shutdown-phase: 30s
management:
endpoint:
health:
probes:
enabled: true
When SIGTERM arrives, Spring stops accepting new requests, waits up to timeout-per-shutdown-phase for in-flight requests to finish, then shuts down the application context. The Actuator readiness endpoint automatically transitions to OUT_OF_SERVICE, which causes Kubernetes to stop routing traffic.
Python
For Flask/WSGI applications, register a SIGTERM handler that sets a shutdown flag:
import signal
import sys
is_shutting_down = False
def handle_sigterm(signum, frame):
global is_shutting_down
is_shutting_down = True
sys.exit(0)
signal.signal(signal.SIGTERM, handle_sigterm)
@app.route('/healthz/ready')
def readiness():
if is_shutting_down:
return '', 503
return '', 200
FastAPI with uvicorn handles SIGTERM natively via uvicorn's signal handling. With Gunicorn + uvicorn workers, verify that SIGTERM propagates from the Gunicorn master to worker processes in your specific setup.
Python's signal handlers run only in the main thread. If your app uses multiprocessing or alternative async frameworks (gevent, trio), test signal propagation separately.
Step 4: verify the result
After configuring preStop, grace period, and signal handling, you must test under load. The race condition only manifests when real traffic is in-flight during a pod restart.
Run a load test and a rolling restart simultaneously:
# Terminal 1: sustained load
hey -z 60s -c 10 http://my-service.default.svc.cluster.local/
# Terminal 2: trigger rolling restart while load runs
kubectl rollout restart deployment/my-app
Expected result: zero non-2xx responses in the load test output. If you see 502 or connection-refused errors, increase the preStop sleep or verify that your application is handling SIGTERM correctly.
To check preStop hook execution:
kubectl describe pod <pod-name>
# Look for "Normal Killing" and "Warning FailedPreStopHook" in events
To verify endpoint removal timing:
kubectl get endpointslices -w # watch endpoint updates during a restart
Complete configuration
Putting it all together. This Deployment configuration handles the endpoint race condition, gives the application time to drain, and ensures the grace period covers the full shutdown window:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0 # never remove a pod without a ready replacement
maxSurge: 1
template:
spec:
terminationGracePeriodSeconds: 60
containers:
- name: app
image: my-app:v2.4.1
ports:
- containerPort: 8080
lifecycle:
preStop:
sleep:
seconds: 15 # wait for endpoint propagation (K8s 1.30+)
readinessProbe:
httpGet:
path: /healthz/ready
port: 8080
periodSeconds: 5
failureThreshold: 2
livenessProbe:
httpGet:
path: /healthz/live
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
failureThreshold: 3
The time budget for this configuration:
t=0s Pod marked Terminating; endpoint removal starts (Track B)
t=0s preStop sleep begins (Track A)
t=15s preStop completes; SIGTERM sent to application
t=15s Application stops accepting connections, starts draining
t=40s Application exits (25s drain window)
t=60s SIGKILL would fire (never reached if shutdown succeeds)
Common problems
preStop hook fails silently. Hooks that depend on a binary not present in the image (like sleep on distroless images) fail with a FailedPreStopHook event. Check kubectl describe pod for this warning. On Kubernetes 1.30+, use the native sleep: action instead of exec.
Application does not receive SIGTERM. Almost always a PID 1 problem. Run kubectl exec <pod> -- ps aux and verify your application process is PID 1. If it is not, fix the Dockerfile or add tini.
Grace period too short. If terminationGracePeriodSeconds is less than the preStop duration plus the application's drain time, the kubelet sends SIGKILL before the application finishes. Exit code 137 in pod status confirms this.
Nginx requires SIGQUIT. Nginx's default SIGTERM handler triggers a fast shutdown that drops connections. For graceful shutdown, send SIGQUIT via a preStop hook: command: ["/usr/sbin/nginx", "-s", "quit"].
Service mesh sidecar exits first. With Istio, both your application and the Envoy sidecar receive SIGTERM simultaneously. If Envoy exits first, outbound calls from your application fail during drain. Set EXIT_ON_ZERO_ACTIVE_CONNECTIONS=true on the sidecar to make Envoy wait for active connections to close.
When to escalate
If you still see 502 errors during deployments after implementing the configuration above, collect the following before asking for help:
- Kubernetes version (
kubectl version) - Cluster size (number of nodes)
- Ingress controller type and version
kubectl describe podoutput from a terminated pod showing eventskubectl get endpointslices -woutput captured during a rolling restart- Load test results showing the error rate and timing
- Whether you run a service mesh and its version
- Pod spec showing
terminationGracePeriodSeconds, preStop hook, and probe configuration