What counts as ephemeral storage, what does not
The Kubernetes documentation defines local ephemeral storage as node-level storage that is not guaranteed to survive a pod restart. The kubelet provides it for scratch space, caching, container image layers, and the writable layers of running containers. From a pod's perspective, three things consume this storage and are accounted against its ephemeral-storage limit:
- The writable container layer. Every running container has a read-write filesystem layer on top of the read-only image layers. Anything the application writes outside a mounted volume goes here: temporary files in
/tmp(when/tmpis not its own emptyDir), unwritten logs in non-stdout paths, files dropped by build steps inside the container. - Container logs. Stdout and stderr from each container are written to files on the node by the container runtime. The kubelet rotates these files and counts them against the pod's ephemeral storage. The default is
containerLogMaxSize: 10Miper file withcontainerLogMaxFiles: 5retained, so each container can consume up to ~50 MiB of node disk for logs alone. emptyDirvolumes (non-tmpfs). An emptyDir mounted with the default disk-backed medium lives in the node's filesystem and counts toward ephemeral storage. An emptyDir mounted withmedium: Memoryis tmpfs and is accounted as memory, not storage.
Several volume types are explicitly not covered by ephemeral-storage limits, even though they look ephemeral:
- CSI ephemeral volumes are managed by third-party CSI drivers, not the kubelet. The Kubernetes documentation states this directly: they are "not covered by the storage resource usage limits of a Pod, because that is something that kubelet can only enforce for storage that it manages itself." See ephemeral volumes.
- Generic ephemeral volumes are dynamically provisioned PVCs that share the pod's lifecycle. Their capacity is governed by the underlying StorageClass and the PVC's
resources.requests.storage, not by the pod'sephemeral-storagelimit. - PersistentVolumes are not ephemeral by definition. For the lifecycle distinction, see Kubernetes PersistentVolumes and PersistentVolumeClaims.
The container image cache on the node is also not charged to any single pod. It lives on the node's imagefs (or nodefs when there is no separate image filesystem) and is reclaimed by the kubelet's image garbage collector, not by per-pod limits.
Why pods get evicted under disk pressure
Two distinct mechanisms can terminate a pod for storage reasons. They produce different statuses, different causes, and different fixes.
Container or pod limit exceeded. When the writable layer plus logs plus emptyDir for a single container exceeds its ephemeral-storage limit, or when the sum across all containers in a pod exceeds the pod-level limit, the kubelet evicts the pod. The pod status shows Failed with reason Evicted and a message naming the limit that was crossed. This is the container-level enforcement introduced as alpha in Kubernetes 1.7, beta in 1.9, and GA in 1.25.
Node-pressure eviction. When the node's filesystem fills up regardless of any individual pod's limit, the kubelet evicts pods to reclaim disk. The trigger is one of the node-pressure eviction signals:
| Signal | Default hard threshold | What it watches |
|---|---|---|
nodefs.available |
10% |
Free space on the kubelet's primary filesystem |
nodefs.inodesFree |
5% |
Free inodes on the primary filesystem |
imagefs.available |
15% |
Free space on the optional image filesystem |
memory.available |
100Mi |
Free memory (independent from disk pressure) |
pid.available |
10% |
Available process IDs |
Before evicting pods, the kubelet first tries to free space at the node level: removing unused container images (when imagefs is pressured) and cleaning up dead containers and pods (when nodefs is pressured). If that does not bring the node back below the threshold, pod eviction begins.
The selection order is fixed: failed pods first, then pods without resource requests, then pods exceeding their requests, then pods below their requests. Within a tier, BestEffort pods are evicted before Burstable, and Burstable before Guaranteed. This is why setting accurate ephemeral-storage requests on production pods materially changes the order in which the kubelet picks victims.
The two mechanisms are independent. A pod with no ephemeral-storage limit can still be evicted by node pressure. A pod under its limit can still be evicted if the node as a whole is full. Both are reported as evictions, but the message text and event reason differ.
Setting ephemeral-storage requests and limits
Requests and limits use the same units as memory: bytes, with binary (Mi, Gi) or decimal (M, G) suffixes. They live alongside CPU and memory in the container's resources block:
apiVersion: v1
kind: Pod
metadata:
name: log-aggregator
namespace: production
spec:
containers:
- name: app
image: registry.internal/app:1.42.0 # version pinned for reproducible deploys
resources:
requests:
ephemeral-storage: "2Gi" # scheduler reserves this
memory: "256Mi"
limits:
ephemeral-storage: "4Gi" # kubelet evicts if pod exceeds
memory: "512Mi"
volumeMounts:
- name: scratch
mountPath: /var/cache/app
volumes:
- name: scratch
emptyDir:
sizeLimit: 1Gi # per-volume cap, evicts on overflow
The scheduler treats requests.ephemeral-storage like any other resource request: it sums the requests of all containers in the pod and only places the pod on a node with enough allocatable ephemeral storage left. Allocatable storage is the node's filesystem capacity minus kube-reserved, system-reserved, and the eviction threshold, per the reserve compute resources guide. On nodes where the kubelet writes to a different filesystem than the cluster expects (a separate disk for /var/lib/kubelet), allocatable accounting can be wrong, and the scheduler will overcommit.
The emptyDir.sizeLimit field is a separate, finer-grained cap. The kubelet evicts the pod when the volume exceeds its sizeLimit, even if the pod's overall ephemeral-storage limit has not been reached. Use this to protect against one runaway emptyDir consuming the whole pod budget.
A LimitRange in the namespace can default ephemeral-storage for pods that omit it, and a ResourceQuota can cap the namespace total. Without a quota and without per-pod limits, ephemeral storage is best-effort and the first pod to fill the node takes everyone with it.
Monitoring ephemeral-storage usage per pod
kubectl top does not report ephemeral storage. The data lives in the kubelet's stats summary endpoint, reachable via kubectl get --raw:
NODE=ip-10-0-1-42.eu-west-1.compute.internal
kubectl get --raw "/api/v1/nodes/${NODE}/proxy/stats/summary" | jq '.pods[] | {
pod: .podRef.name,
namespace: .podRef.namespace,
used_bytes: .ephemeral-storage.usedBytes,
capacity_bytes: .ephemeral-storage.capacityBytes
}'
Each pod's ephemeral-storage.usedBytes includes the writable layer, container logs, and emptyDir volumes combined. capacityBytes reflects the limit, or the node's available capacity when no limit is set.
For continuous visibility, scrape these stats with a sidecar exporter. The community-maintained k8s-ephemeral-storage-metrics exposes pod-level ephemeral storage as Prometheus metrics, fillable into the same alerting flow as container_memory_working_set_bytes.
One detail that catches teams off-guard: kubelet measurement is periodic scanning, not realtime. The kubelet samples ephemeral usage every few seconds. A pod that writes 2 GiB in one burst can cross its limit and write significantly past it before the next scan triggers eviction. Filesystem project quota, when enabled in the kubelet config, reduces this lag, but it requires a project-quota-capable filesystem (XFS or ext4 with project-quota support) and explicit kubelet configuration.
Common causes of ephemeral-storage exhaustion
A handful of patterns produce most disk-pressure incidents I have seen on Kubernetes nodes.
Unbounded container logs. An application logging to stdout at high rate fills the kubelet's log files faster than rotation can prune them. With the default containerLogMaxSize: 10Mi and containerLogMaxFiles: 5, each container is capped at ~50 MiB on disk, but a pod with 50 containers (DaemonSet, sidecars stacked up) hits 2.5 GiB just from logs. A misconfigured logger writing JSON-per-line at 1 MB/s produces 86 GB per day per container; rotation runs, but the log directory still keeps several rotated files at any moment.
Verbose log files written by the application directly. Logs that bypass stdout and land in /var/log/myapp/ inside the container go into the writable layer, not the kubelet-managed log directory. The kubelet does not rotate them. Without logrotate inside the container or a sidecar that ships and truncates them, the writable layer grows monotonically.
emptyDir volumes used as caches without cleanup. Build pods, image-processing services, and CI runners that use emptyDir as scratch space rarely clean up. The volume only gets reset when the pod terminates. Long-running pods with churning emptyDir contents fill the volume to its sizeLimit (or to the pod's overall limit if no sizeLimit is set).
Image pulls on small disks. When imagefs is on the same disk as nodefs (the typical single-filesystem case), every image pull eats from the same pool that pods use for logs and writable layers. A node with many distinct images cycled through it accumulates layers until image GC kicks in at imageGCHighThresholdPercent (85% by default). On small node disks, this threshold and the pod's free space race each other.
Crash loops generating dump files. Containers that crash repeatedly often write core dumps, heap dumps, or thread dumps to local paths. Each crash is a few hundred MB. A pod in a CrashLoopBackOff with dump-on-exit enabled fills its writable layer over hours.
Prevention: log rotation, image cleanup, sizeLimit, node reservations
The defenses, in order of how often they actually matter:
- Set
ephemeral-storagelimits on every production container. Without limits, a pod isBestEffortfor ephemeral storage and is the first thing evicted under node pressure. - Cap emptyDir volumes with
sizeLimit. Even when the pod-level limit covers the same ground,sizeLimitlocalizes the failure to the volume that misbehaves rather than evicting the whole pod when a single emptyDir runs away. - Tune kubelet log rotation for high-rate loggers.
containerLogMaxSize: 100MiwithcontainerLogMaxFiles: 3gives 300 MiB per container before rotation, which is more than enough for most workloads. LoweringcontainerLogMaxFilesto 2 or 3 helps on small nodes where 50 MiB times the container count adds up. - Reserve ephemeral storage for the system and the kubelet. Set
kubeReserved.ephemeral-storageandsystemReserved.ephemeral-storageso the kubelet's allocatable accounting reflects what is actually available. Without this, the scheduler thinks the whole disk is for pods and overcommits. - Tighten image GC thresholds on small disks.
imageGCHighThresholdPercent: 75andimageGCLowThresholdPercent: 70start cleanup earlier than the 85/80 defaults, which gives more headroom before an image pull fails. - Ship application-written logs off the node. Logs that bypass stdout need to leave the writable layer. A logging sidecar that tails the file and truncates it, or an application-side rolling appender with a small backlog, prevents the writable layer from growing forever.
emptyDir with Memory medium: different accounting
The most common surprise with emptyDir is the medium swap. An emptyDir with medium: Memory is a tmpfs mount, not disk. Anything written to it lives in RAM, and the bytes count toward the container's memory limit, not its ephemeral-storage limit.
volumes:
- name: shared-tmpfs
emptyDir:
medium: Memory
sizeLimit: 256Mi # capped, but capped against memory
Since Kubernetes 1.22, the SizeMemoryBackedVolumes feature gate is on by default. With it on, sizeLimit is enforced for memory-backed emptyDir; without it, the volume could grow to whatever the node's tmpfs allowed.
Two practical implications:
- A pod with a 512 MiB memory limit and a 256 MiB tmpfs emptyDir has only 256 MiB of memory for the application. Filling the volume causes an OOM kill, not a disk-pressure eviction. For the difference between OOMKill and node-pressure eviction, see OOMKilled: Kubernetes out of memory errors explained.
- A tmpfs emptyDir does not protect against disk pressure on the node, but it also does not contribute to it. If the only reason a workload uses emptyDir is to avoid disk I/O, switching to
medium: Memoryremoves the workload from the disk-pressure equation entirely.
The choice is a memory-vs-disk trade. Memory is faster and bounded by RAM. Disk is larger and bounded by the node's filesystem. There is no reason to pick medium: Memory for general scratch space; pick it when the application benefits from tmpfs latency or when keeping data off disk matters.
What ephemeral storage is NOT
Ephemeral storage is not the same as df -h on the node. Running df -h on a node shows the node's root filesystem, including system files, the container runtime's directories, and storage held by pods that have already terminated. It does not show what any individual container's writable layer is consuming. For per-pod numbers, use the kubelet's stats summary endpoint.
Ephemeral storage is not protected by the memory limit. Memory pressure and disk pressure are independent. A pod with a generous memory limit can still be evicted for filling the node's disk. A pod under disk pressure does not get more memory; the kubelet either evicts it or, when the limit is set, evicts it sooner.
Stdout and stderr logs do count. It is tempting to assume that logs sent to standard streams are routed to a centralized logger and do not touch the node's disk. They do. The container runtime writes them to files in /var/log/pods/, the kubelet rotates them, and they are charged against the pod's ephemeral-storage usage until the pod is removed.
CSI ephemeral volumes do not count. A pod that mounts a CSI ephemeral volume can store gigabytes there without affecting its ephemeral-storage budget, because the kubelet does not measure it. The capacity comes from the CSI driver and the underlying storage. This is by design, but it surprises operators who set tight ephemeral-storage limits and then watch a pod use far more disk than expected.
An emptyDir is not persistent across pod replacements. Pod restart and container restart are different events. Container restart preserves the emptyDir; pod replacement (eviction, deletion, rescheduling to another node) does not. For storage that must survive pod replacement, use a PersistentVolumeClaim.
Setting a limit does not make the limit enforced instantly. Because kubelet measurement is periodic, a pod can briefly exceed its limit between scans without being evicted. Conversely, a pod just under its limit can still be evicted by node pressure if the cluster is full. Limits shape behavior; they do not act as realtime hard caps in the way memory cgroups do.