Kubernetes node drain and cordon: safe maintenance without downtime

A safe node-maintenance procedure for Kubernetes uses two commands: kubectl cordon to stop new pods landing on the node, then kubectl drain to evict the existing ones through the Eviction API. This guide walks through the full cordon-drain-maintain-uncordon flow, the flags every drain needs (--ignore-daemonsets, --delete-emptydir-data, --force, --disable-eviction), and how managed Kubernetes services on AWS, Azure and GCP differ in their drain timeouts.

Goal

A node is removed from scheduling, every workload on it is gracefully migrated to other nodes, the maintenance task (kernel patch, disk replacement, kubelet upgrade, decommission) runs on an empty node, and the node either rejoins the cluster or is replaced. Throughout the procedure no end user notices, because PodDisruptionBudgets and graceful termination keep the application's serving capacity intact.

Prerequisites

  • kubectl connected to a Kubernetes 1.27 or newer cluster (the Eviction API and the policy/v1 PDB API are stable across all currently supported releases)
  • The node name you intend to drain (kubectl get nodes)
  • Cluster permissions to evict pods (pods/eviction create) and patch nodes (nodes patch)
  • For workloads worth protecting: a PodDisruptionBudget on every Deployment or StatefulSet you care about. Without a PDB, drain is free to evict every replica at once.
  • Enough spare capacity in the rest of the cluster to host the evicted pods. If the cluster is full, evicted pods will sit in Pending and the application loses capacity.

What cordon and drain actually do (and what they do not do)

The two commands look adjacent and they are constantly conflated, but they have very different scope. Knowing the difference is half the maintenance procedure.

kubectl cordon <node> flips a single field on the node object: .spec.unschedulable: true. Behind the scenes the node controller adds the node.kubernetes.io/unschedulable:NoSchedule taint. New pods stop being scheduled there. Pods already on the node keep running. Nothing is evicted, nothing is restarted, nothing changes for the existing workload.

kubectl drain <node> does two things: it cordons the node first, then it iterates over every pod on the node and submits one Eviction API request per pod. Each eviction request goes through the same admission path a regular pod deletion would: PodDisruptionBudgets are evaluated, terminationGracePeriodSeconds is honored, lifecycle preStop hooks fire. If the PDB rejects an eviction, the API returns HTTP 429 and kubectl drain retries until it succeeds or the --timeout expires.

Three things drain explicitly does not do:

  • It does not delete the node from the cluster. After drain finishes, the Node object is still there, just Unschedulable: true. To bring it back, run kubectl uncordon. To remove it permanently, run kubectl delete node separately.
  • It does not stop the kubelet, reboot the host, or touch the underlying VM. That is your job; drain only clears the workload off it.
  • It does not evict DaemonSet pods. They are owned by a controller that re-creates them on every node, so evicting them is pointless. By default this means drain refuses to proceed; you pass --ignore-daemonsets to acknowledge it.

A common misconception in incident reviews sounds like "the team ran kubectl delete node so the workload moved". It did not move; the pods were forcibly orphaned and recreated as new pods elsewhere, with no graceful shutdown, no preStop hook, and no respect for PDB. kubectl drain is the only way to migrate workload off a node safely.

The maintenance workflow: cordon, drain, maintain, uncordon

The four-step pattern that works for kernel upgrades, kubelet upgrades, disk swaps, and node replacement:

# 1. Stop new pods from landing on the node (instant, non-disruptive)
kubectl cordon worker-3

# 2. Evict existing pods through the Eviction API
kubectl drain worker-3 \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --grace-period=-1 \
  --timeout=15m

# 3. Run the maintenance task on the now-empty node
#    (kernel patch, kubelet upgrade, disk swap, etc.)

# 4. Bring the node back into rotation
kubectl uncordon worker-3

The reason cordon comes first as a separate command, even though drain cordons internally, is to give you a clean review window. After step 1 you can run kubectl get pods -o wide --field-selector spec.nodeName=worker-3 to see exactly what is about to be evicted, sanity-check the PDBs, and confirm the cluster has spare capacity, before you commit to the drain.

Verifying step 1. The node now shows SchedulingDisabled next to its Ready status:

$ kubectl get nodes
NAME       STATUS                     ROLES    AGE   VERSION
worker-1   Ready                      <none>   42d   v1.30.4
worker-2   Ready                      <none>   42d   v1.30.4
worker-3   Ready,SchedulingDisabled   <none>   42d   v1.30.4

Verifying step 2. When drain returns, the node has zero pods left except DaemonSet-managed ones (CNI, monitoring agents, log shipper):

$ kubectl get pods -o wide --field-selector spec.nodeName=worker-3
NAME                            READY   STATUS    RESTARTS   AGE   NODE
calico-node-7p9zx               1/1     Running   0          42d   worker-3
fluent-bit-x4c2m                1/1     Running   0          42d   worker-3
node-exporter-bbm49             1/1     Running   0          42d   worker-3

Verifying step 4. The SchedulingDisabled marker disappears and the scheduler resumes placing pods on the node:

$ kubectl get nodes worker-3
NAME       STATUS   ROLES    AGE   VERSION
worker-3   Ready    <none>   42d   v1.30.4

For a planned node replacement, step 3 swaps for "delete the VM, let your node-pool autoscaler or cloud controller create a new one to replace it", and step 4 is unnecessary because the new node joins ready-to-schedule.

Common failure: DaemonSet pods block drain (--ignore-daemonsets)

The first time most operators run kubectl drain they hit this:

error: unable to drain node "worker-3" due to error: cannot delete DaemonSet-managed Pods (use --ignore-daemonsets to ignore): kube-system/calico-node-7p9zx, kube-system/fluent-bit-x4c2m

DaemonSet pods are designed to run one-per-node. The DaemonSet controller tolerates the node.kubernetes.io/unschedulable:NoSchedule taint, so even if drain managed to evict them, the controller would re-create them immediately. Drain refuses to enter that loop and asks you to explicitly skip them.

Pass --ignore-daemonsets on every drain. There is essentially no scenario in a real cluster where you do not have at least a CNI plugin (Calico, Cilium, Flannel) running as a DaemonSet, so the flag is not optional.

kubectl drain worker-3 --ignore-daemonsets

The DaemonSet pods stay running on the cordoned node throughout maintenance. When you delete the node or run uncordon, they get cleaned up by the DaemonSet controller. If maintenance reboots the kubelet, the DaemonSet pods are simply restarted by the kubelet on boot.

Common failure: pods using emptyDir block drain (--delete-emptydir-data)

The second time you run drain, this happens:

error: unable to drain node "worker-3" due to error: cannot delete Pods with local storage (use --delete-emptydir-data to override): default/build-cache-pj8vn

emptyDir volumes are pod-scoped scratch space allocated on the node's filesystem. When the pod terminates, the emptyDir is deleted with it. Drain refuses by default because there is a real chance you do not realize the pod has data there. Build caches, sidecar buffers, and tmpfs working directories all live in emptyDir. Evicting the pod throws that data away.

If you have audited the workload and the data is safe to discard (which is true for the vast majority of emptyDir uses), pass the flag:

kubectl drain worker-3 --ignore-daemonsets --delete-emptydir-data

Naming note. This flag was called --delete-local-data until Kubernetes 1.20, when pull request #95076 renamed it to --delete-emptydir-data because the old name made it sound like the flag also affected hostPath and persistent volumes (it does not; only emptyDir). Older runbooks and Stack Overflow answers still reference the old name. The compatibility shim was eventually removed; on any cluster you actually run today, --delete-emptydir-data is the correct spelling.

If a workload genuinely needs to keep emptyDir data across maintenance, the design is wrong: emptyDir is by definition ephemeral. Move it to a PersistentVolumeClaim or a hostPath mount tied to the node's lifecycle.

Common failure: pods without a controller block drain (--force)

error: unable to drain node "worker-3" due to error: cannot delete Pods that declare no controller (use --force to override): default/debug-shell

A pod created directly (kubectl run, kubectl apply -f pod.yaml) without a Deployment, StatefulSet, ReplicaSet, Job, or DaemonSet on top of it is an "orphan" pod. Nothing recreates it once it is gone. Drain refuses by default because evicting it equals losing it permanently, and most operators do not actually want that.

The legitimate cases where you do want it:

  • Debug pods you spawned with kubectl run for diagnostics.
  • Static pods accidentally left behind from manual experimentation.
  • Pods with a controller you intentionally deleted (orphan-deletion sequences).

In those cases, pass --force:

kubectl drain worker-3 --ignore-daemonsets --delete-emptydir-data --force

If the orphan pod is something you actually need, the right fix is to convert it to a Deployment first, not to add --force. Drain is telling you exactly what is fragile in your cluster.

Common failure: PDB blocks eviction (--disable-eviction last resort)

When a PodDisruptionBudget does not allow the eviction, the Eviction API returns HTTP 429 (Too Many Requests). kubectl drain keeps retrying:

evicting pod default/web-api-7c5fbf6dd5-kf9lk
error when evicting pods/"web-api-7c5fbf6dd5-kf9lk" -n "default" (will retry after 5s):
  Cannot evict pod as it would violate the pod's disruption budget.

This is the system working as intended. The PDB is telling drain "wait, evicting this pod right now would take the application below minAvailable". The right response is to wait. The Deployment controller will start replacement pods on other nodes, the PDB's currentHealthy will rise, and the next retry will succeed.

A PDB-blocked drain becomes a real problem when:

  • The Deployment has only one replica and a minAvailable: 1 PDB. Eviction can never succeed; the application is fundamentally not maintenance-safe.
  • A pod is stuck in CrashLoopBackOff and the PDB still counts it. (Kubernetes 1.26 introduced unhealthyPodEvictionPolicy: AlwaysAllow to fix this; turn it on per PDB.)
  • Two PDBs select the same pod, which produces an HTTP 500 from the Eviction API, not a 429.
  • Cluster autoscaling is not running, so no replacement nodes exist for evicted pods to land on.

The escape hatch, used as a last resort:

kubectl drain worker-3 \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --disable-eviction

--disable-eviction makes drain bypass the Eviction API and call DELETE on each pod directly. PDBs are not consulted. This is a controlled outage: every replica matched by the drain on this node disappears at once. Use it only when:

  1. You have already diagnosed the PDB block.
  2. You have a maintenance window the application can tolerate.
  3. You have explicit business approval for the disruption.

If you reach for --disable-eviction repeatedly, the underlying issue is that workloads are not configured for safe maintenance. Fix the PodDisruptionBudgets and replica counts; do not normalize bypassing them.

Monitoring drain progress

kubectl drain is verbose and prints one line per pod. On a node with 50 pods that becomes hard to read. Useful patterns:

Watch the pod list shrink in another terminal:

watch -n 2 'kubectl get pods --field-selector spec.nodeName=worker-3 -A --no-headers | wc -l'

Watch eviction events:

kubectl get events --watch --field-selector reason=Evicted

Track which PDBs are at zero allowed disruptions (if drain hangs):

kubectl get pdb --all-namespaces -o wide | awk '$5==0'

If kubectl drain appears to do nothing for several minutes, the most common cause is exactly that: a PDB at zero disruptions allowed. Identify it, decide whether to wait for replacement pods to come up, or invoke the escape hatch.

Cloud-managed nodes: how node pool upgrades differ (GKE, EKS, AKS)

When a managed Kubernetes service rolls a node pool upgrade, it runs the same cordon/drain dance under the hood, but with vendor-specific timeouts and behavior. Knowing these numbers stops surprises during upgrades.

Service Default drain timeout Configurable Behavior on timeout
GKE (Google) 1 hour No Forcefully evicts remaining pods, upgrade continues
EKS (AWS) 15 minutes No PodEvictionFailure, upgrade fails unless --force was passed
AKS (Azure) 30 minutes Yes (--drain-timeout) Configurable behavior: default fails the upgrade, Cordon mode quarantines the node

The implications for cluster operators:

  • GKE never fails an upgrade because of a stuck PDB. After an hour, your minAvailable was a suggestion. If you depend on PDB enforcement, that surprise is real.
  • EKS is the strictest. A PDB that briefly hits zero disruptions allowed is enough to fail a managed node group rolling update. Either pass --force (which is the AWS equivalent of --disable-eviction) or fix the PDB.
  • AKS is the most flexible: you can extend the drain timeout to several hours, and the --undrainable-node-behavior Cordon option keeps the upgrade moving by quarantining stuck nodes for manual handling later.

For all three, before initiating an upgrade, scan for PDBs that will block:

kubectl get pdb --all-namespaces -o wide | awk 'NR==1 || $5==0'

Any row with ALLOWED DISRUPTIONS = 0 is a near-certain upgrade failure on EKS, a delayed-by-an-hour upgrade on GKE, and a quarantine event on AKS.

Complete drain command (the form you can copy)

The drain command suitable for the vast majority of production maintenance, in one place:

kubectl drain worker-3 \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --grace-period=-1 \
  --timeout=15m

Each flag and why:

  • --ignore-daemonsets: every cluster has CNI and monitoring DaemonSets; without this drain refuses to start.
  • --delete-emptydir-data: any pod with a build cache or scratch volume blocks drain otherwise; you have to make a positive decision that ephemeral data is acceptable to lose.
  • --grace-period=-1: honor each pod's own terminationGracePeriodSeconds. Setting a positive value here overrides the pod's grace period, which can truncate clean shutdowns. -1 (the default) is almost always right.
  • --timeout=15m: gives slow pods enough time to terminate cleanly, but stops drain from hanging forever on a misconfigured PDB. Tune up or down based on the typical termination cost of your workloads.

Add --force if (and only if) you know there are uncontrolled pods on the node and you accept losing them. Add --disable-eviction if (and only if) PDBs are blocking and you have approval for the disruption.

For a zero-downtime rolling deployment of an entire node pool, drain one node at a time, wait for the evicted pods to become Ready elsewhere, then move to the next.

When to escalate

Drain is hung past your timeout and you are not sure why. Collect this before asking for help:

  • kubectl version (client and server)
  • kubectl get nodes -o wide (so the node states are visible)
  • kubectl get pods --field-selector spec.nodeName=<node> -A -o wide (what is still on the node)
  • kubectl get pdb --all-namespaces -o wide (PDB state across the cluster)
  • kubectl describe pdb <pdb> for any PDB at zero disruptions, including events
  • The exact kubectl drain command and flags used
  • Cloud provider and managed service tier (GKE, EKS, AKS, self-managed)
  • Whether the node is reachable from the control plane (kubectl describe node <name> for the most recent heartbeat)
  • Whether the cluster has spare capacity for evicted pods (kubectl describe nodes | grep -A 5 Allocated)

If the node itself is in NotReady, drain is not the right tool; recovering the node or replacing it is.

How to prevent recurrence

  • Put a PodDisruptionBudget on every Deployment and StatefulSet. Never run a single-replica production workload behind a PDB; either run two replicas or accept the disruption.
  • Set unhealthyPodEvictionPolicy: AlwaysAllow on PDBs for stateless services so a CrashLoopBackOff does not hang drain forever.
  • Set realistic terminationGracePeriodSeconds on your pods. The default 30 seconds is fine for stateless HTTP services; databases, queue workers, and long-running jobs need more.
  • Run drains as part of your routine maintenance rotation, not only during incidents. The first drain of an unfamiliar workload should never be at 03:00 during a kernel CVE response.
  • For autoscaling clusters, audit your maxUnavailable and maxSurge settings against your PDBs. The combination of "PDB allows 1 disruption" and "node pool surges 5 nodes at once" leads to predictable upgrade hangs.
  • Monitor kube_poddisruptionbudget_status_current_healthy versus kube_poddisruptionbudget_status_desired_healthy from kube-state-metrics, and alert when any PDB sits at zero disruptions allowed for more than a few minutes.

Recurring server or deployment issues?

I help teams make production reliable with CI/CD, Kubernetes, and cloud—so fixes stick and deploys stop being stressful.

Explore DevOps consultancy

Search this site

Start typing to search, or browse the knowledge base and blog.