Kubernetes pod eviction: node pressure, disk pressure, and Evicted status

Evicted pods in kubectl get pods are the kubelet's signal that a node ran out of memory, disk, or PIDs. The kubelet picks pods to terminate using a three-step ranking that does not use QoS class as a direct input, then sets status.phase=Failed and status.reason=Evicted. This article covers how to read the eviction reason, clean up the leftover pod objects, identify which pressure caused it, and stop it from happening again.

What an Evicted pod looks like

You ran kubectl get pods and saw rows like this:

NAME                          READY   STATUS    RESTARTS   AGE
api-7d4f9b8c6-xvk2p           0/1     Evicted   0          12m
api-7d4f9b8c6-r9k8t           0/1     Evicted   0          11m
api-7d4f9b8c6-2m4qn           1/1     Running   0          3m

Evicted is not a real pod phase. It is the value the kubelet stamps onto status.reason when it terminated the pod for node-pressure reasons. The actual status.phase on the leftover pod object is Failed. The pod's containers are gone. Only the API object remains, taking up a slot in the namespace's pods object-count quota until something deletes it.

If a workload controller (Deployment, ReplicaSet, StatefulSet) owns the pod, a replacement is scheduled almost immediately. The Evicted object is a tombstone, not a running pod. Bare pods (no controller) are not replaced.

Eviction is not OOMKilled and not preemption

These three failure modes look similar in dashboards and produce overlapping symptoms, but they are different events with different fixes.

Mode Who acts Trigger Pod result
Container OOMKilled Linux kernel cgroup OOM killer Container exceeds its resources.limits.memory One container exit code 137; pod stays, kubelet restarts the container per restartPolicy
Node-pressure eviction Kubelet eviction manager Node-wide signal crosses an eviction threshold Whole pod terminated; status.phase=Failed, status.reason=Evicted
Preemption kube-scheduler A pending higher-priority pod needs room Lower-priority pod evicted to make room; rescheduled per its controller

Three rules of thumb. The kernel kills one container; the kubelet kills the whole pod. OOMKilled fires when one container crosses its own cgroup limit; eviction fires when the node as a whole is under pressure. Preemption is a scheduler decision, not a kubelet decision, and is independent of resource pressure on the node.

Reading the eviction reason in kubectl describe pod

Start here, every time:

kubectl describe pod api-7d4f9b8c6-xvk2p -n production

Two parts of the output matter. Status:

Status:           Failed
Reason:           Evicted
Message:          The node was low on resource: memory. Threshold quantity: 100Mi,
                  available: 87Mi. Container api was using 312Mi, request is 256Mi,
                  has larger consumption of memory.

The Message line names the eviction signal that fired (memory, ephemeral-storage, pid, etc.) and the threshold quantity vs. the available quantity at eviction time. It also tells you which container the kubelet flagged as the worst offender relative to its request, which is the central input to the kubelet's pod-selection ranking (see below).

Then the events:

Events:
  Type     Reason    Age   From               Message
  ----     ------    ----  ----               -------
  Warning  Evicted   12m   kubelet            The node was low on resource: memory.
  Normal   Killing   12m   kubelet            Stopping container api

Event reason Evicted from kubelet confirms it. If you see Preempted from default-scheduler instead, you are looking at scheduler preemption, not node-pressure eviction; the cause is a higher-priority pending pod, not node resource pressure.

To find every Evicted pod cluster-wide:

kubectl get pods --all-namespaces --field-selector=status.phase=Failed -o json \
  | jq -r '.items[] | select(.status.reason=="Evicted") | "\(.metadata.namespace)/\(.metadata.name)"'

Or to count them per node, which often points straight at the misbehaving node:

kubectl get pods -A -o json \
  | jq -r '.items[] | select(.status.reason=="Evicted") | .spec.nodeName' \
  | sort | uniq -c | sort -rn

A node with dozens of Evicted pods is the node to investigate.

Cleaning up Evicted pods

Evicted pods do not delete themselves. The kubelet leaves the API object in place so you can inspect why it was evicted. Long-lived Evicted pods produce two real problems.

The first is quota pressure. Failed pods (including Evicted ones) count against the pods and count/pods object-count quotas because those quotas are independent of pod phase. They do not count against requests.cpu and requests.memory quotas, which only sum pods in a non-terminal state. So an Evicted pod consumes a pod-count slot but no CPU or memory budget. In a namespace with pods: 100, a hundred Evicted leftovers will block new pod creation even though the cluster has plenty of CPU and memory headroom.

The second is observability noise. Lists, dashboards, and alerts that filter on status.phase=Failed or count failed pods will trip on the leftovers long after the underlying pressure has cleared.

To delete every Evicted pod in a namespace:

kubectl get pods -n production --field-selector=status.phase=Failed -o json \
  | jq -r '.items[] | select(.status.reason=="Evicted") | .metadata.name' \
  | xargs -r kubectl delete pod -n production

Or cluster-wide:

kubectl get pods -A --field-selector=status.phase=Failed -o json \
  | jq -r '.items[] | select(.status.reason=="Evicted") | "-n \(.metadata.namespace) \(.metadata.name)"' \
  | xargs -r -L1 kubectl delete pod

Verification: re-run kubectl get pods -n production --field-selector=status.phase=Failed and confirm the list is empty. The pods quota usage drops correspondingly: kubectl describe resourcequota -n production.

A cleanup loop is a workaround, not a fix. The fix is to stop the underlying pressure (covered below).

Eviction signals and node conditions

The kubelet monitors a fixed set of signals on every node and translates them into node conditions visible in kubectl describe node.

Signal What it measures Node condition Default hard threshold
memory.available node.status.capacity[memory] minus the working set MemoryPressure < 100Mi
nodefs.available Free space on the kubelet's main filesystem (/var/lib/kubelet, logs, emptyDir) DiskPressure < 10%
nodefs.inodesFree Free inodes on nodefs DiskPressure < 5%
imagefs.available Free space on the container runtime's image filesystem DiskPressure < 15%
imagefs.inodesFree Free inodes on imagefs DiskPressure (no documented default)
containerfs.available Free space on the writable-container-layer filesystem (Linux only, separate disk feature) DiskPressure (matches imagefs)
pid.available maxpid minus curproc PIDPressure (no documented default)

memory.available on Linux is computed from cgroup memory accounting, not from free -m. Inactive file-backed pages count as reclaimable, so the number is higher than what free reports as "free." This is intentional: the kubelet's signal reflects what is actually reclaimable under pressure.

Hard thresholds fire immediately, with no grace period. Soft thresholds (configured via evictionSoft) honor a per-signal evictionSoftGracePeriod and the pod's terminationGracePeriodSeconds up to evictionMaxPodGracePeriod. Hard eviction never honors terminationGracePeriodSeconds or PodDisruptionBudgets.

To stop oscillation between pressure and no-pressure when the signal is hovering near a threshold, the kubelet keeps a node condition for evictionPressureTransitionPeriod after the signal recovers. The default is 5m0s. So MemoryPressure: True will stick around for at least five minutes after the underlying memory pressure clears.

When MemoryPressure: True or DiskPressure: True is set, the node.kubernetes.io/memory-pressure:NoSchedule or node.kubernetes.io/disk-pressure:NoSchedule taint is automatically applied to the node. This prevents the scheduler from placing new pods on a node that is actively shedding load. If you see Pending pods alongside Evicted pods, this is a likely cause; see Pod stuck in Pending: why Kubernetes cannot schedule your workload.

How the kubelet picks which pod to evict

This is the part that people get wrong most often. The kubelet's pod selection ranking for node-pressure eviction is a three-step sort, in this order:

  1. Whether the pod's usage of the starved resource exceeds its request. Pods using more than they requested are evicted before pods using less than they requested. A pod with requests.memory: 256Mi and current working set 312Mi is a higher-ranked eviction candidate than a pod with requests.memory: 256Mi using 200Mi.
  2. Pod Priority. Within the group of pods that exceed their requests, lower-priority pods are evicted before higher-priority ones. Priority comes from the pod's priorityClassName and resolves to an integer. Built-in classes system-cluster-critical and system-node-critical carry very high values and are evicted last.
  3. Magnitude of usage above request. Within the same priority tier, pods that exceed their request by a larger amount are evicted before pods that exceed it by a smaller amount.

QoS class is not a direct ranking input. It is sometimes summarized as "BestEffort dies first, then Burstable, then Guaranteed," which is the typical outcome but not the mechanism. The mechanism is the three-step sort above. The outcome looks QoS-shaped because:

  • BestEffort pods have no requests, so any usage exceeds their (zero) request, putting them at the top of step 1.
  • Guaranteed pods have requests equal to limits, so by definition their usage cannot exceed requests under normal operation, putting them at the bottom of step 1.
  • Burstable pods land in the middle depending on whether they happen to exceed their requests when pressure hits.

Practical implication: a Guaranteed pod with a high priority value and modest actual usage is the safest configuration against eviction. A BestEffort pod with low priority is the least safe.

There is one carve-out worth knowing. Static pods, mirror pods, and pods with priorityClassName: system-node-critical (or system-cluster-critical) are excluded from the ranking entirely; the kubelet will not select them for eviction even under pressure. This protects daemonset workloads like kube-proxy, the CNI plugin, and metrics agents that the node depends on to function.

Root cause A: node memory pressure

The eviction message names memory.available. The node hit MemoryPressure and the kubelet shed pods.

Memory pressure on a node has two common origins. The first is overcommit: the sum of container memory limits across the node is much larger than node capacity, and several pods happened to burst toward their limits at the same time. The scheduler does not look at limits, only at requests, so a node can be massively overcommitted on limits while the scheduler still considers it underutilized. The second is a single hot pod with no limit set (BestEffort) or a wildly oversized limit, growing faster than the kubelet can reclaim.

Diagnosis

Find the node that triggered the pressure:

kubectl describe node <node-name> | grep -A 5 "Conditions"
# Look for MemoryPressure with a recent transition time, or for the condition
# being True now if you are still under pressure.

Compare actual memory usage to capacity (requires metrics-server):

kubectl top node
# NAME       CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
# node-3     2400m        60%    14820Mi         92%

Sum container limits on the node to spot overcommit:

kubectl get pods -A --field-selector=spec.nodeName=<node-name> -o json \
  | jq '[.items[].spec.containers[].resources.limits.memory] | length'

For finer-grained analysis, kubectl describe node reports Allocated resources with both Requests and Limits as percentages of node allocatable. Limits at 200% or more is a significant overcommit signal.

Fix

Three layers, depending on what diagnosis showed.

Per-pod. If the eviction message named one container as having a "larger consumption of memory," that pod's request is too low and its actual working set is too high. Either raise resources.requests.memory to match observed usage so it is no longer the eviction target, or fix the underlying memory growth. The OOMKilled article walks through right-sizing memory limits and language-specific gotchas (JVM, Go, Node.js, Python).

Per-node. If overcommit is the issue, lower aggregate memory limits across the workloads on that node, or add capacity. The resource requests and limits article covers the request-vs-limit relationship and overcommit math in detail.

Cluster-wide. If multiple nodes are hitting MemoryPressure, the cluster is undersized. Add nodes, enable Cluster Autoscaler, or move noisy workloads to a dedicated node pool with taints.

Verification: kubectl describe node <node-name> shows MemoryPressure: False, and no new Evicted pods appear in kubectl get events -A | grep Evicted over the next ten minutes (longer than evictionPressureTransitionPeriod).

Root cause B: disk pressure and ephemeral storage

The eviction message names nodefs.available, nodefs.inodesFree, or imagefs.available. The node hit DiskPressure.

Disk pressure has more sources than memory pressure does. The most common are container logs growing without rotation, emptyDir volumes filling up, the writable container layer accumulating data the application wrote outside any volume, image cache bloat on the imagefs, and inode exhaustion (lots of tiny files in logs or caches). Less common but worth checking: the kubelet's working directory itself filling up because checkpoint files or terminated-pod state did not clean up.

Local ephemeral storage capacity isolation went GA in Kubernetes 1.25. With it, you can set resources.requests.ephemeral-storage and resources.limits.ephemeral-storage on a container, and the kubelet eviction manager will terminate pods that exceed their ephemeral-storage limit regardless of overall node disk pressure. This turns one noisy pod into a self-contained problem instead of a node-wide one.

Diagnosis

Identify which signal fired:

kubectl describe node <node-name> | grep -B 1 -A 5 "DiskPressure"

On the node itself (SSH, debug pod, or kubelet logs):

df -h /var/lib/kubelet /var/lib/docker /var/lib/containerd 2>/dev/null
df -i /var/lib/kubelet /var/lib/docker /var/lib/containerd 2>/dev/null

Find the largest consumers:

sudo du -sh /var/lib/kubelet/pods/*/volumes/* 2>/dev/null | sort -h | tail
sudo du -sh /var/log/pods/* 2>/dev/null | sort -h | tail

For images on imagefs (containerd):

sudo crictl images --no-trunc | sort -k 6 -h

For container disk usage (Docker):

docker system df -v

For deeper coverage of ephemeral-storage accounting and what the kubelet measures, see Kubernetes ephemeral storage: limits, eviction, and container disk management.

Fix

Per-pod. Set resources.limits.ephemeral-storage on every container that writes to local disk. Move heavy writes to a PersistentVolume (or a tmpfs emptyDir if the data is truly ephemeral and small). Ensure log rotation is configured.

Per-node. Run image garbage collection: the kubelet does this automatically when imagefs hits imagefs.available < 85% (default imageGCHighThresholdPercent), but you can prune manually with crictl rmi --prune or docker image prune -a -f. Increase the size of /var/lib/kubelet and the imagefs volume if the node was simply provisioned too small.

Cluster-wide. Enforce a pod-level default for requests.ephemeral-storage via a LimitRange. Without a default, BestEffort pods on ephemeral-storage are routine and they are the first to be evicted when any pod fills the disk.

Verification: kubectl describe node <node-name> shows DiskPressure: False. df -h on the node shows ample free space. The image cache after GC is meaningfully smaller (crictl images | wc -l drops).

Root cause C: PID pressure

The eviction message names pid.available. The node ran out of process IDs.

PID exhaustion is rarer than memory or disk pressure, but it shows up in two distinct shapes. The first is a process leak: an application spawning child processes faster than it reaps them, eventually consuming the node-wide PID space. The second is fork-bombs from misconfigured workloads (a poorly-written shell loop, an unbounded subprocess pool in Python, or a shell that is calling itself recursively).

The kubelet enforces PID-based isolation through Pod PID limits (SupportPodPidsLimit) and node-level reservations (SystemPidsLimit). When neither is set, a single runaway pod can starve the entire node.

Diagnosis

kubectl describe node <node-name> | grep -B 1 -A 5 "PIDPressure"

On the node:

cat /proc/sys/kernel/pid_max          # ceiling
ps -e --no-headers | wc -l            # current

Per-pod process counts (containerd):

sudo crictl ps -q | xargs -I{} sh -c 'echo "$(crictl inspect --output go-template --template '"'"'{{.info.pid}}'"'"' {}) {}"'

Or with kubectl debug and ps:

kubectl debug node/<node-name> -it --image=busybox -- sh -c "ps -ef | wc -l"

Fix

Set per-pod PID limits via the kubelet flag --pod-max-pids (or the equivalent kubelet config field), or per-container limits via the resources field where supported. Investigate the leaking application: a process tree that grows without bound is the application's bug, not a Kubernetes problem. The kubelet eviction is a backstop, not the cure.

Verification: kubectl describe node <node-name> shows PIDPressure: False, and ps -e | wc -l on the node returns a steady number under load instead of climbing.

Tuning eviction thresholds

The defaults are conservative and reasonable for most clusters. Tune them only when the diagnosis points at a clear mismatch with your hardware or workload pattern.

The kubelet config (typically /var/lib/kubelet/config.yaml on a node, or the cluster-wide kubelet ConfigMap on managed platforms) accepts these fields:

# /var/lib/kubelet/config.yaml (Kubernetes 1.29+ KubeletConfiguration)
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
evictionHard:
  memory.available: "200Mi"           # raise to evict earlier on small nodes
  nodefs.available: "10%"
  nodefs.inodesFree: "5%"
  imagefs.available: "15%"
evictionSoft:
  memory.available: "500Mi"           # warn earlier with a grace period
evictionSoftGracePeriod:
  memory.available: "2m"
evictionMaxPodGracePeriod: 30          # cap on terminationGracePeriodSeconds during soft eviction
evictionPressureTransitionPeriod: "5m" # default; raise to dampen flapping
evictionMinimumReclaim:
  memory.available: "100Mi"            # reclaim at least this much beyond the threshold
  nodefs.available: "500Mi"

Three tuning patterns are worth knowing.

Earlier hard eviction on small nodes. The default memory.available: 100Mi is tiny on a 64 GiB node. Raising it to 200Mi or 500Mi gives the kubelet more headroom to act before the kernel OOM killer fires (which is faster and less graceful). The trade-off is slightly more usable capacity reserved.

Soft thresholds with grace periods. evictionSoft with a multi-minute evictionSoftGracePeriod gives transient spikes a chance to recover without evicting. Pair with evictionMaxPodGracePeriod so a soft eviction respects (a capped version of) terminationGracePeriodSeconds. This is friendlier for stateful workloads.

evictionMinimumReclaim. Without this, the kubelet stops evicting as soon as the signal moves one byte above the threshold, which produces a ping-pong pattern. Setting evictionMinimumReclaim makes the kubelet keep going until it reclaims a meaningful chunk past the threshold. This is the single highest-impact tuning for clusters that show repeated flapping between MemoryPressure: True and MemoryPressure: False.

On managed Kubernetes (GKE, EKS, AKS), kubelet config edits are platform-specific and often go through a node-pool configuration object rather than direct edits to /var/lib/kubelet/config.yaml. Consult the platform's docs for the right surface.

When to escalate

If you have worked through diagnosis and the evictions persist, collect this before asking for help:

  • Output of kubectl describe pod <evicted-pod-name> -n <namespace> (full output)
  • Output of kubectl describe node <affected-node-name>, especially the Conditions, Allocated resources, and Events sections
  • kubectl get events -A --sort-by='.lastTimestamp' | grep -i evict for the last hour
  • Cluster-wide Evicted-per-node count (the jq snippet from earlier in this article)
  • Kubernetes version (kubectl version)
  • Whether the cluster is managed (GKE, EKS, AKS, OpenShift, k3s, kubeadm, etc.)
  • Output of df -h and df -i on the affected node (or via kubectl debug node/<name>)
  • Output of kubectl top node and kubectl top pod -A | sort -k 4 -h | tail -20
  • The kubelet config in use: on managed platforms, the node-pool config; on self-managed, the contents of /var/lib/kubelet/config.yaml
  • Whether any LimitRange or ResourceQuota is in effect: kubectl get limitrange,resourcequota -A

This is enough to diagnose the eviction without follow-up questions.

How to prevent recurrence

A recurring eviction problem is a sizing problem, not a runtime problem. Prevention happens at workload definition and cluster provisioning time, not at the eviction-cleanup step.

  • Set resources.requests.memory based on observed p95 working-set, not on guesses. A pod whose actual usage exceeds its request is the eviction manager's first target.
  • Set resources.limits.memory for every production container. BestEffort QoS is a recipe for eviction whenever any neighbor causes memory pressure.
  • Set resources.requests.ephemeral-storage and resources.limits.ephemeral-storage on any container that writes to local disk (logs to stdout, emptyDir, scratch files). Without these, a noisy neighbor takes down the whole node's disk budget.
  • Use a LimitRange to inject default ephemeral-storage and memory requests/limits at admission time so namespaces cannot run pods without them.
  • Assign priorityClassName to critical workloads so they survive the eviction ranking even when usage exceeds requests.
  • Run image garbage collection aggressively on imagefs-heavy workloads: lower imageGCHighThresholdPercent (default 85) to 75 if images accumulate.
  • Add a Prometheus alert on kube_pod_status_reason{reason="Evicted"} > 0 so you see evictions when they happen, not when a customer reports them. A second alert on kubelet_node_condition{condition="MemoryPressure"} == 1 for more than five minutes catches sustained pressure.
  • For workloads where eviction is unacceptable (databases, persistent queues, anything with a stateful disk), use Guaranteed QoS (request equals limit), a high priorityClassName, and a PodDisruptionBudget. Note that PodDisruptionBudgets do not protect against hard eviction, only against voluntary disruption and soft eviction.

Recurring server or deployment issues?

I help teams make production reliable with CI/CD, Kubernetes, and cloud—so fixes stick and deploys stop being stressful.

Explore DevOps consultancy

Search this site

Start typing to search, or browse the knowledge base and blog.