What an Evicted pod looks like
You ran kubectl get pods and saw rows like this:
NAME READY STATUS RESTARTS AGE
api-7d4f9b8c6-xvk2p 0/1 Evicted 0 12m
api-7d4f9b8c6-r9k8t 0/1 Evicted 0 11m
api-7d4f9b8c6-2m4qn 1/1 Running 0 3m
Evicted is not a real pod phase. It is the value the kubelet stamps onto status.reason when it terminated the pod for node-pressure reasons. The actual status.phase on the leftover pod object is Failed. The pod's containers are gone. Only the API object remains, taking up a slot in the namespace's pods object-count quota until something deletes it.
If a workload controller (Deployment, ReplicaSet, StatefulSet) owns the pod, a replacement is scheduled almost immediately. The Evicted object is a tombstone, not a running pod. Bare pods (no controller) are not replaced.
Eviction is not OOMKilled and not preemption
These three failure modes look similar in dashboards and produce overlapping symptoms, but they are different events with different fixes.
| Mode | Who acts | Trigger | Pod result |
|---|---|---|---|
| Container OOMKilled | Linux kernel cgroup OOM killer | Container exceeds its resources.limits.memory |
One container exit code 137; pod stays, kubelet restarts the container per restartPolicy |
| Node-pressure eviction | Kubelet eviction manager | Node-wide signal crosses an eviction threshold | Whole pod terminated; status.phase=Failed, status.reason=Evicted |
| Preemption | kube-scheduler | A pending higher-priority pod needs room | Lower-priority pod evicted to make room; rescheduled per its controller |
Three rules of thumb. The kernel kills one container; the kubelet kills the whole pod. OOMKilled fires when one container crosses its own cgroup limit; eviction fires when the node as a whole is under pressure. Preemption is a scheduler decision, not a kubelet decision, and is independent of resource pressure on the node.
Reading the eviction reason in kubectl describe pod
Start here, every time:
kubectl describe pod api-7d4f9b8c6-xvk2p -n production
Two parts of the output matter. Status:
Status: Failed
Reason: Evicted
Message: The node was low on resource: memory. Threshold quantity: 100Mi,
available: 87Mi. Container api was using 312Mi, request is 256Mi,
has larger consumption of memory.
The Message line names the eviction signal that fired (memory, ephemeral-storage, pid, etc.) and the threshold quantity vs. the available quantity at eviction time. It also tells you which container the kubelet flagged as the worst offender relative to its request, which is the central input to the kubelet's pod-selection ranking (see below).
Then the events:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Evicted 12m kubelet The node was low on resource: memory.
Normal Killing 12m kubelet Stopping container api
Event reason Evicted from kubelet confirms it. If you see Preempted from default-scheduler instead, you are looking at scheduler preemption, not node-pressure eviction; the cause is a higher-priority pending pod, not node resource pressure.
To find every Evicted pod cluster-wide:
kubectl get pods --all-namespaces --field-selector=status.phase=Failed -o json \
| jq -r '.items[] | select(.status.reason=="Evicted") | "\(.metadata.namespace)/\(.metadata.name)"'
Or to count them per node, which often points straight at the misbehaving node:
kubectl get pods -A -o json \
| jq -r '.items[] | select(.status.reason=="Evicted") | .spec.nodeName' \
| sort | uniq -c | sort -rn
A node with dozens of Evicted pods is the node to investigate.
Cleaning up Evicted pods
Evicted pods do not delete themselves. The kubelet leaves the API object in place so you can inspect why it was evicted. Long-lived Evicted pods produce two real problems.
The first is quota pressure. Failed pods (including Evicted ones) count against the pods and count/pods object-count quotas because those quotas are independent of pod phase. They do not count against requests.cpu and requests.memory quotas, which only sum pods in a non-terminal state. So an Evicted pod consumes a pod-count slot but no CPU or memory budget. In a namespace with pods: 100, a hundred Evicted leftovers will block new pod creation even though the cluster has plenty of CPU and memory headroom.
The second is observability noise. Lists, dashboards, and alerts that filter on status.phase=Failed or count failed pods will trip on the leftovers long after the underlying pressure has cleared.
To delete every Evicted pod in a namespace:
kubectl get pods -n production --field-selector=status.phase=Failed -o json \
| jq -r '.items[] | select(.status.reason=="Evicted") | .metadata.name' \
| xargs -r kubectl delete pod -n production
Or cluster-wide:
kubectl get pods -A --field-selector=status.phase=Failed -o json \
| jq -r '.items[] | select(.status.reason=="Evicted") | "-n \(.metadata.namespace) \(.metadata.name)"' \
| xargs -r -L1 kubectl delete pod
Verification: re-run kubectl get pods -n production --field-selector=status.phase=Failed and confirm the list is empty. The pods quota usage drops correspondingly: kubectl describe resourcequota -n production.
A cleanup loop is a workaround, not a fix. The fix is to stop the underlying pressure (covered below).
Eviction signals and node conditions
The kubelet monitors a fixed set of signals on every node and translates them into node conditions visible in kubectl describe node.
| Signal | What it measures | Node condition | Default hard threshold |
|---|---|---|---|
memory.available |
node.status.capacity[memory] minus the working set |
MemoryPressure |
< 100Mi |
nodefs.available |
Free space on the kubelet's main filesystem (/var/lib/kubelet, logs, emptyDir) |
DiskPressure |
< 10% |
nodefs.inodesFree |
Free inodes on nodefs | DiskPressure |
< 5% |
imagefs.available |
Free space on the container runtime's image filesystem | DiskPressure |
< 15% |
imagefs.inodesFree |
Free inodes on imagefs | DiskPressure |
(no documented default) |
containerfs.available |
Free space on the writable-container-layer filesystem (Linux only, separate disk feature) | DiskPressure |
(matches imagefs) |
pid.available |
maxpid minus curproc |
PIDPressure |
(no documented default) |
memory.available on Linux is computed from cgroup memory accounting, not from free -m. Inactive file-backed pages count as reclaimable, so the number is higher than what free reports as "free." This is intentional: the kubelet's signal reflects what is actually reclaimable under pressure.
Hard thresholds fire immediately, with no grace period. Soft thresholds (configured via evictionSoft) honor a per-signal evictionSoftGracePeriod and the pod's terminationGracePeriodSeconds up to evictionMaxPodGracePeriod. Hard eviction never honors terminationGracePeriodSeconds or PodDisruptionBudgets.
To stop oscillation between pressure and no-pressure when the signal is hovering near a threshold, the kubelet keeps a node condition for evictionPressureTransitionPeriod after the signal recovers. The default is 5m0s. So MemoryPressure: True will stick around for at least five minutes after the underlying memory pressure clears.
When MemoryPressure: True or DiskPressure: True is set, the node.kubernetes.io/memory-pressure:NoSchedule or node.kubernetes.io/disk-pressure:NoSchedule taint is automatically applied to the node. This prevents the scheduler from placing new pods on a node that is actively shedding load. If you see Pending pods alongside Evicted pods, this is a likely cause; see Pod stuck in Pending: why Kubernetes cannot schedule your workload.
How the kubelet picks which pod to evict
This is the part that people get wrong most often. The kubelet's pod selection ranking for node-pressure eviction is a three-step sort, in this order:
- Whether the pod's usage of the starved resource exceeds its request. Pods using more than they requested are evicted before pods using less than they requested. A pod with
requests.memory: 256Miand current working set 312Mi is a higher-ranked eviction candidate than a pod withrequests.memory: 256Miusing 200Mi. - Pod Priority. Within the group of pods that exceed their requests, lower-priority pods are evicted before higher-priority ones. Priority comes from the pod's
priorityClassNameand resolves to an integer. Built-in classessystem-cluster-criticalandsystem-node-criticalcarry very high values and are evicted last. - Magnitude of usage above request. Within the same priority tier, pods that exceed their request by a larger amount are evicted before pods that exceed it by a smaller amount.
QoS class is not a direct ranking input. It is sometimes summarized as "BestEffort dies first, then Burstable, then Guaranteed," which is the typical outcome but not the mechanism. The mechanism is the three-step sort above. The outcome looks QoS-shaped because:
- BestEffort pods have no requests, so any usage exceeds their (zero) request, putting them at the top of step 1.
- Guaranteed pods have requests equal to limits, so by definition their usage cannot exceed requests under normal operation, putting them at the bottom of step 1.
- Burstable pods land in the middle depending on whether they happen to exceed their requests when pressure hits.
Practical implication: a Guaranteed pod with a high priority value and modest actual usage is the safest configuration against eviction. A BestEffort pod with low priority is the least safe.
There is one carve-out worth knowing. Static pods, mirror pods, and pods with priorityClassName: system-node-critical (or system-cluster-critical) are excluded from the ranking entirely; the kubelet will not select them for eviction even under pressure. This protects daemonset workloads like kube-proxy, the CNI plugin, and metrics agents that the node depends on to function.
Root cause A: node memory pressure
The eviction message names memory.available. The node hit MemoryPressure and the kubelet shed pods.
Memory pressure on a node has two common origins. The first is overcommit: the sum of container memory limits across the node is much larger than node capacity, and several pods happened to burst toward their limits at the same time. The scheduler does not look at limits, only at requests, so a node can be massively overcommitted on limits while the scheduler still considers it underutilized. The second is a single hot pod with no limit set (BestEffort) or a wildly oversized limit, growing faster than the kubelet can reclaim.
Diagnosis
Find the node that triggered the pressure:
kubectl describe node <node-name> | grep -A 5 "Conditions"
# Look for MemoryPressure with a recent transition time, or for the condition
# being True now if you are still under pressure.
Compare actual memory usage to capacity (requires metrics-server):
kubectl top node
# NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
# node-3 2400m 60% 14820Mi 92%
Sum container limits on the node to spot overcommit:
kubectl get pods -A --field-selector=spec.nodeName=<node-name> -o json \
| jq '[.items[].spec.containers[].resources.limits.memory] | length'
For finer-grained analysis, kubectl describe node reports Allocated resources with both Requests and Limits as percentages of node allocatable. Limits at 200% or more is a significant overcommit signal.
Fix
Three layers, depending on what diagnosis showed.
Per-pod. If the eviction message named one container as having a "larger consumption of memory," that pod's request is too low and its actual working set is too high. Either raise resources.requests.memory to match observed usage so it is no longer the eviction target, or fix the underlying memory growth. The OOMKilled article walks through right-sizing memory limits and language-specific gotchas (JVM, Go, Node.js, Python).
Per-node. If overcommit is the issue, lower aggregate memory limits across the workloads on that node, or add capacity. The resource requests and limits article covers the request-vs-limit relationship and overcommit math in detail.
Cluster-wide. If multiple nodes are hitting MemoryPressure, the cluster is undersized. Add nodes, enable Cluster Autoscaler, or move noisy workloads to a dedicated node pool with taints.
Verification: kubectl describe node <node-name> shows MemoryPressure: False, and no new Evicted pods appear in kubectl get events -A | grep Evicted over the next ten minutes (longer than evictionPressureTransitionPeriod).
Root cause B: disk pressure and ephemeral storage
The eviction message names nodefs.available, nodefs.inodesFree, or imagefs.available. The node hit DiskPressure.
Disk pressure has more sources than memory pressure does. The most common are container logs growing without rotation, emptyDir volumes filling up, the writable container layer accumulating data the application wrote outside any volume, image cache bloat on the imagefs, and inode exhaustion (lots of tiny files in logs or caches). Less common but worth checking: the kubelet's working directory itself filling up because checkpoint files or terminated-pod state did not clean up.
Local ephemeral storage capacity isolation went GA in Kubernetes 1.25. With it, you can set resources.requests.ephemeral-storage and resources.limits.ephemeral-storage on a container, and the kubelet eviction manager will terminate pods that exceed their ephemeral-storage limit regardless of overall node disk pressure. This turns one noisy pod into a self-contained problem instead of a node-wide one.
Diagnosis
Identify which signal fired:
kubectl describe node <node-name> | grep -B 1 -A 5 "DiskPressure"
On the node itself (SSH, debug pod, or kubelet logs):
df -h /var/lib/kubelet /var/lib/docker /var/lib/containerd 2>/dev/null
df -i /var/lib/kubelet /var/lib/docker /var/lib/containerd 2>/dev/null
Find the largest consumers:
sudo du -sh /var/lib/kubelet/pods/*/volumes/* 2>/dev/null | sort -h | tail
sudo du -sh /var/log/pods/* 2>/dev/null | sort -h | tail
For images on imagefs (containerd):
sudo crictl images --no-trunc | sort -k 6 -h
For container disk usage (Docker):
docker system df -v
For deeper coverage of ephemeral-storage accounting and what the kubelet measures, see Kubernetes ephemeral storage: limits, eviction, and container disk management.
Fix
Per-pod. Set resources.limits.ephemeral-storage on every container that writes to local disk. Move heavy writes to a PersistentVolume (or a tmpfs emptyDir if the data is truly ephemeral and small). Ensure log rotation is configured.
Per-node. Run image garbage collection: the kubelet does this automatically when imagefs hits imagefs.available < 85% (default imageGCHighThresholdPercent), but you can prune manually with crictl rmi --prune or docker image prune -a -f. Increase the size of /var/lib/kubelet and the imagefs volume if the node was simply provisioned too small.
Cluster-wide. Enforce a pod-level default for requests.ephemeral-storage via a LimitRange. Without a default, BestEffort pods on ephemeral-storage are routine and they are the first to be evicted when any pod fills the disk.
Verification: kubectl describe node <node-name> shows DiskPressure: False. df -h on the node shows ample free space. The image cache after GC is meaningfully smaller (crictl images | wc -l drops).
Root cause C: PID pressure
The eviction message names pid.available. The node ran out of process IDs.
PID exhaustion is rarer than memory or disk pressure, but it shows up in two distinct shapes. The first is a process leak: an application spawning child processes faster than it reaps them, eventually consuming the node-wide PID space. The second is fork-bombs from misconfigured workloads (a poorly-written shell loop, an unbounded subprocess pool in Python, or a shell that is calling itself recursively).
The kubelet enforces PID-based isolation through Pod PID limits (SupportPodPidsLimit) and node-level reservations (SystemPidsLimit). When neither is set, a single runaway pod can starve the entire node.
Diagnosis
kubectl describe node <node-name> | grep -B 1 -A 5 "PIDPressure"
On the node:
cat /proc/sys/kernel/pid_max # ceiling
ps -e --no-headers | wc -l # current
Per-pod process counts (containerd):
sudo crictl ps -q | xargs -I{} sh -c 'echo "$(crictl inspect --output go-template --template '"'"'{{.info.pid}}'"'"' {}) {}"'
Or with kubectl debug and ps:
kubectl debug node/<node-name> -it --image=busybox -- sh -c "ps -ef | wc -l"
Fix
Set per-pod PID limits via the kubelet flag --pod-max-pids (or the equivalent kubelet config field), or per-container limits via the resources field where supported. Investigate the leaking application: a process tree that grows without bound is the application's bug, not a Kubernetes problem. The kubelet eviction is a backstop, not the cure.
Verification: kubectl describe node <node-name> shows PIDPressure: False, and ps -e | wc -l on the node returns a steady number under load instead of climbing.
Tuning eviction thresholds
The defaults are conservative and reasonable for most clusters. Tune them only when the diagnosis points at a clear mismatch with your hardware or workload pattern.
The kubelet config (typically /var/lib/kubelet/config.yaml on a node, or the cluster-wide kubelet ConfigMap on managed platforms) accepts these fields:
# /var/lib/kubelet/config.yaml (Kubernetes 1.29+ KubeletConfiguration)
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
evictionHard:
memory.available: "200Mi" # raise to evict earlier on small nodes
nodefs.available: "10%"
nodefs.inodesFree: "5%"
imagefs.available: "15%"
evictionSoft:
memory.available: "500Mi" # warn earlier with a grace period
evictionSoftGracePeriod:
memory.available: "2m"
evictionMaxPodGracePeriod: 30 # cap on terminationGracePeriodSeconds during soft eviction
evictionPressureTransitionPeriod: "5m" # default; raise to dampen flapping
evictionMinimumReclaim:
memory.available: "100Mi" # reclaim at least this much beyond the threshold
nodefs.available: "500Mi"
Three tuning patterns are worth knowing.
Earlier hard eviction on small nodes. The default memory.available: 100Mi is tiny on a 64 GiB node. Raising it to 200Mi or 500Mi gives the kubelet more headroom to act before the kernel OOM killer fires (which is faster and less graceful). The trade-off is slightly more usable capacity reserved.
Soft thresholds with grace periods. evictionSoft with a multi-minute evictionSoftGracePeriod gives transient spikes a chance to recover without evicting. Pair with evictionMaxPodGracePeriod so a soft eviction respects (a capped version of) terminationGracePeriodSeconds. This is friendlier for stateful workloads.
evictionMinimumReclaim. Without this, the kubelet stops evicting as soon as the signal moves one byte above the threshold, which produces a ping-pong pattern. Setting evictionMinimumReclaim makes the kubelet keep going until it reclaims a meaningful chunk past the threshold. This is the single highest-impact tuning for clusters that show repeated flapping between MemoryPressure: True and MemoryPressure: False.
On managed Kubernetes (GKE, EKS, AKS), kubelet config edits are platform-specific and often go through a node-pool configuration object rather than direct edits to /var/lib/kubelet/config.yaml. Consult the platform's docs for the right surface.
When to escalate
If you have worked through diagnosis and the evictions persist, collect this before asking for help:
- Output of
kubectl describe pod <evicted-pod-name> -n <namespace>(full output) - Output of
kubectl describe node <affected-node-name>, especially theConditions,Allocated resources, andEventssections kubectl get events -A --sort-by='.lastTimestamp' | grep -i evictfor the last hour- Cluster-wide Evicted-per-node count (the
jqsnippet from earlier in this article) - Kubernetes version (
kubectl version) - Whether the cluster is managed (GKE, EKS, AKS, OpenShift, k3s, kubeadm, etc.)
- Output of
df -handdf -ion the affected node (or viakubectl debug node/<name>) - Output of
kubectl top nodeandkubectl top pod -A | sort -k 4 -h | tail -20 - The kubelet config in use: on managed platforms, the node-pool config; on self-managed, the contents of
/var/lib/kubelet/config.yaml - Whether any LimitRange or ResourceQuota is in effect:
kubectl get limitrange,resourcequota -A
This is enough to diagnose the eviction without follow-up questions.
How to prevent recurrence
A recurring eviction problem is a sizing problem, not a runtime problem. Prevention happens at workload definition and cluster provisioning time, not at the eviction-cleanup step.
- Set
resources.requests.memorybased on observed p95 working-set, not on guesses. A pod whose actual usage exceeds its request is the eviction manager's first target. - Set
resources.limits.memoryfor every production container. BestEffort QoS is a recipe for eviction whenever any neighbor causes memory pressure. - Set
resources.requests.ephemeral-storageandresources.limits.ephemeral-storageon any container that writes to local disk (logs to stdout, emptyDir, scratch files). Without these, a noisy neighbor takes down the whole node's disk budget. - Use a LimitRange to inject default ephemeral-storage and memory requests/limits at admission time so namespaces cannot run pods without them.
- Assign
priorityClassNameto critical workloads so they survive the eviction ranking even when usage exceeds requests. - Run image garbage collection aggressively on imagefs-heavy workloads: lower
imageGCHighThresholdPercent(default85) to75if images accumulate. - Add a Prometheus alert on
kube_pod_status_reason{reason="Evicted"} > 0so you see evictions when they happen, not when a customer reports them. A second alert onkubelet_node_condition{condition="MemoryPressure"} == 1for more than five minutes catches sustained pressure. - For workloads where eviction is unacceptable (databases, persistent queues, anything with a stateful disk), use Guaranteed QoS (request equals limit), a high
priorityClassName, and a PodDisruptionBudget. Note that PodDisruptionBudgets do not protect against hard eviction, only against voluntary disruption and soft eviction.