Kubernetes resource requests and limits

Requests tell the scheduler how much CPU and memory a pod needs to land on a node. Limits tell the kernel when to throttle or kill. Mixing them up leads to pods stuck in Pending, surprise OOMKills, or throttled latency on nodes with idle capacity. This article explains the mechanics behind both, how Kubernetes assigns QoS classes from them, and how to pick initial values that match your workload.

Requests vs. limits: scheduling vs. enforcement

Requests and limits control two different things at two different times. Confusing one for the other is the root cause of most resource-related incidents.

Requests are a scheduling input. The kube-scheduler reads them to decide which node has enough room for a pod. A pod requesting 500m CPU and 256Mi memory will only land on a node that has at least that much unallocated. Once placed, the container can consume more than its request if the node has slack.

Limits are a runtime enforcement boundary. The kubelet and the Linux kernel enforce them after the pod is running. If a container tries to exceed its CPU limit, the kernel throttles it. If it exceeds its memory limit, the kernel kills it.

Aspect	Requests	Limits
Used by	kube-scheduler (placement)	kubelet + kernel (runtime)
Timing	Before the pod starts	While the pod runs
Can be exceeded?	Yes, if node has slack	CPU: no (throttled). Memory: no (killed)
If omitted	No scheduling guarantee	No hard cap on usage
Affects QoS class	Yes	Yes

One rule catches people off guard: if you set a limit but omit the request, Kubernetes copies the limit as the request. That makes the pod Guaranteed QoS, which is often unintentional and over-constrains scheduling.

CPU requests and limits

CPU is a compressible resource. When a container hits its CPU limit, the kernel slows it down. The process survives; it just gets fewer cycles.

How CPU requests work

Under contention, the Linux Completely Fair Scheduler (CFS) distributes CPU time proportionally to requests. If Pod A requests 200m and Pod B requests 600m on the same node, A gets 25% and B gets 75% of available CPU when both compete for it.

Units: 1 CPU = 1000 millicores. 250m means 25% of one core's scheduling time per period.

How CPU limits work

Limits are enforced via CFS quota. On cgroup v1, this is cpu.cfs_quota_us and cpu.cfs_period_us. On cgroup v2, cpu.max. The default period is 100ms. A limit of 500m means the container may run for 50ms out of every 100ms window. If it exhausts that budget, the kernel throttles it until the next period starts, even if the node has 100% idle CPU.

That last point matters. CPU throttling is not a signal that the node is overloaded. It is a signal that the container's own limit is too low for its burst pattern.

The "should I set CPU limits?" debate

The Kubernetes community is split. The official Kubernetes blog argues limits are valuable for capacity planning predictability, especially in multi-tenant clusters. Others, including Sysdig and learnk8s, argue that CPU limits mostly hurt latency-sensitive services because CFS proportional sharing already protects against starvation through requests alone.

My position: set CPU requests always. Set CPU limits only when you have a specific reason (multi-tenant isolation, cost attribution, or batch workloads you want to bound). For latency-sensitive services, I have seen too many cases where CPU throttling caused latency spikes on nodes that were barely loaded.

Go runtime gotcha

Before Go 1.25, the Go runtime sets GOMAXPROCS to the node's total CPU count, not the container's limit. On a 32-core node with a 1-core CPU limit, Go spawns 32 threads. They exhaust the 100ms CFS quota almost instantly, causing severe throttling and amplified GC pauses. Fix: use uber-go/automaxprocs for Go < 1.25. Go 1.25+ handles this natively.

Memory requests and limits

Memory is an incompressible resource. When a container exceeds its memory limit, the kernel does not slow it down. It kills the process. No graceful degradation, no warning. The pod restarts with exit code 137 and reason OOMKilled.

How memory requests work

Requests reserve capacity for scheduling: the node must have at least this much memory available. On cgroup v2 clusters with the MemoryQoS feature gate enabled (alpha, requires cgroup v2), the request maps to memory.min, guaranteeing that amount can never be reclaimed by the kernel, even under node memory pressure.

Units: bytes with binary suffixes. Use Mi (mebibytes) and Gi (gibibytes), not M and G. The difference is roughly 5%, and Kubernetes interprets them literally.

How memory limits work

Limits set the hard ceiling via memory.limit_in_bytes (cgroup v1) or memory.max (cgroup v2). Exceed it by a single byte, and the kernel OOM killer terminates the container. The kubelet detects the kill, logs the OOMKilled event, and restarts the container per its restart policy.

Two distinct OOM scenarios

Container OOM. The container exceeds its own cgroup memory limit. The kernel kills that container. The pod restarts. This is the common case and appears as OOMKilled in kubectl describe pod.
Node OOM. The node runs out of memory before the kubelet's eviction manager can act. The kernel OOM killer picks victim processes across the entire node. QoS class determines which pods die first. This is rarer but far more severe.

Why you should (almost) always set memory limits

Unlike CPU, where proportional sharing under contention provides natural protection, memory has no fallback. A container without a memory limit can grow until it triggers node-level OOM, taking down neighbors. Set memory limits for every production workload.

QoS classes and eviction priority

Kubernetes assigns a QoS class to every pod based on how requests and limits are configured. The QoS class determines eviction priority when the node runs out of resources.

Guaranteed

Every container in the pod has CPU and memory requests equal to their limits, and both are set. This is the most protected class. Guaranteed pods are evicted only after all Burstable and BestEffort pods have been removed. They are also eligible for exclusive CPU core assignment when the CPU Manager static policy is enabled.

The tradeoff: no burst capacity. The pod always pays the full scheduling cost of its resources, even when it uses a fraction.

# Guaranteed QoS example
resources:
  requests:
    cpu: "500m"
    memory: "256Mi"
  limits:
    cpu: "500m"      # equals request
    memory: "256Mi"  # equals request

Burstable

At least one container has a request or limit, but the pod does not meet Guaranteed criteria. This is the most common QoS class in production. Burstable pods can consume beyond their requests when the node has slack, and they are evicted after BestEffort pods but before Guaranteed.

# Burstable QoS example
resources:
  requests:
    cpu: "250m"
    memory: "256Mi"
  limits:
    cpu: "1000m"     # higher than request
    memory: "512Mi"  # higher than request

BestEffort

No container in the pod has any request or limit for CPU or memory. First to be evicted under any node pressure. Acceptable for non-critical batch jobs or dev environments. Not for production services.

Eviction order within the same QoS class

When the kubelet needs to evict pods and multiple pods share the same QoS class, it ranks them by (current_usage - request) / request. The pod consuming the most relative to its request is evicted first. Setting accurate requests matters: a pod with low requests and high actual usage becomes the top eviction candidate in its class. For the full three-step ranking the kubelet applies (and the role pod priority plays alongside QoS), see Kubernetes pod eviction: node pressure, disk pressure, and Evicted status.

What requests and limits are not

This section exists to prevent the most common misunderstandings I see.

Limits do not affect scheduling. The kube-scheduler does not read limits. A pod with request: 100m and limit: 8000m will be scheduled to a node with just 100m available, then potentially burst to 8 cores and starve neighbors. The official documentation is explicit: "the kube-scheduler uses [requests] to decide which node to place the Pod on."

CPU limits do not protect the node from starvation. Under contention, CFS distributes CPU proportionally to requests. A pod without limits will not hog the node when there is actual contention. Limits only cap individual containers regardless of contention.

"OOMKilled" is not the same as "evicted." Eviction is a kubelet-initiated process that respects pod disruption budgets (soft eviction) or fires immediately (hard eviction). OOMKill is a kernel signal that terminates a process because it exceeded its cgroup memory limit. They are separate mechanisms with different causes and different remediation paths.

Memory is not "throttled." Only CPU is throttled (slowed down). Memory is killed. If someone says "the pod was throttled for memory," they probably mean it was OOMKilled, which is a fundamentally different failure mode.

Setting initial values

Starting from scratch with no production data? Three methods, in order of preference.

VPA recommendation mode

Deploy the Vertical Pod Autoscaler with UpdateMode: Off alongside your workload. Let it collect usage data for 7 to 14 days. Read the recommendations with kubectl describe vpa <name>. Validate the numbers against actual load patterns before applying.

Set minAllowed and maxAllowed bounds to prevent runaway recommendations. And do not use VPA and HPA on the same resource dimension (both targeting CPU causes conflicting scaling decisions).

Load testing with profiling

Deploy with generous initial limits. Run representative load tests (k6, Gatling, Locust) against the pod. Monitor container_memory_working_set_bytes and container_cpu_usage_seconds_total in Prometheus. Take P95 memory usage under peak load as the memory request. Add a 20% buffer for the memory limit. Take P95 CPU usage as the CPU request. Decide on a CPU limit based on the workload type (latency-sensitive: no limit or high headroom; batch: tight limit).

kubectl top as a starting point

For running workloads with no observability stack yet, kubectl top pods gives real-time usage. Compare against current requests and limits. A rough starting point: request = 1.1 to 1.2x average observed usage.

Workload type	CPU request	CPU limit	Memory request	Memory limit
Stateless web API	100–250m	None or 2–4x request	128–512Mi	1.5–2x request
Background worker	100–500m	2x request	256Mi–1Gi	1.2–1.5x request
Database sidecar	50–100m	2x request	64–128Mi	2x request
Critical stateful	Expected usage	= request (Guaranteed)	Expected usage	= request (Guaranteed)
Batch job	100–500m	2x request	256Mi–2Gi	= request

These are starting points. Replace them with observed data as soon as you have it.

Overcommit strategy

Overcommit occurs when the sum of container limits across a node exceeds physical node capacity. The sum of requests can never exceed node capacity, because the scheduler prevents it. But limits can.

Why overcommit works

Applications exhibit bursty usage patterns. Web services typically consume 10 to 30% of allocated resources during normal operation, with spikes to 60 to 80%. If every pod ran at its limit simultaneously, the node would be overloaded. In practice, that rarely happens, and overcommit improves cluster density.

CPU vs. memory overcommit

CPU overcommit is low risk. Under contention, CFS shares protect pods proportionally to their requests. A node with 200% CPU overcommit (limits sum to 2x node CPU) handles contention gracefully through throttling.

Memory overcommit is high risk. There is no graceful degradation. If pods collectively exceed available memory, the kernel starts killing processes. Conservative memory overcommit (limits = 1.0 to 1.5x requests) is the norm.

The fixed-fraction headroom pattern

The official Kubernetes blog recommends limits = requests x (1 + small_percentage). Setting limits at 1.1x to 1.2x requests gives pods a small burst budget while bounding total overcommit. This produces Burstable QoS (not Guaranteed), which is the right tradeoff for most workloads.

Where to go next

This article covers the core mechanics. For specific failure modes and operational tasks built on these concepts:

When a pod restarts with exit code 137, the OOMKilled troubleshooting guide walks through diagnosis and memory limit sizing
When latency spikes without visible load, the CPU throttling guide covers CFS mechanics, how to read throttling metrics, and arguments for removing CPU limits
When the cluster bill is too high but the workloads are healthy, the cost optimization guide walks through rightsizing, namespace quotas, and spot instance integration
When you need to cap aggregate consumption across an entire namespace and enforce per-container floors and ceilings, the ResourceQuota and LimitRange guide covers admission ordering, the must specify limits.memory error, scopes, and a complete reference for every spec field

Recurring server or deployment issues?

I help teams make production reliable with CI/CD, Kubernetes, and cloud—so fixes stick and deploys stop being stressful.

Explore DevOps consultancy