Requests vs. limits: scheduling vs. enforcement
Requests and limits control two different things at two different times. Confusing one for the other is the root cause of most resource-related incidents.
Requests are a scheduling input. The kube-scheduler reads them to decide which node has enough room for a pod. A pod requesting 500m CPU and 256Mi memory will only land on a node that has at least that much unallocated. Once placed, the container can consume more than its request if the node has slack.
Limits are a runtime enforcement boundary. The kubelet and the Linux kernel enforce them after the pod is running. If a container tries to exceed its CPU limit, the kernel throttles it. If it exceeds its memory limit, the kernel kills it.
| Aspect | Requests | Limits |
|---|---|---|
| Used by | kube-scheduler (placement) | kubelet + kernel (runtime) |
| Timing | Before the pod starts | While the pod runs |
| Can be exceeded? | Yes, if node has slack | CPU: no (throttled). Memory: no (killed) |
| If omitted | No scheduling guarantee | No hard cap on usage |
| Affects QoS class | Yes | Yes |
One rule catches people off guard: if you set a limit but omit the request, Kubernetes copies the limit as the request. That makes the pod Guaranteed QoS, which is often unintentional and over-constrains scheduling.
CPU requests and limits
CPU is a compressible resource. When a container hits its CPU limit, the kernel slows it down. The process survives; it just gets fewer cycles.
How CPU requests work
Under contention, the Linux Completely Fair Scheduler (CFS) distributes CPU time proportionally to requests. If Pod A requests 200m and Pod B requests 600m on the same node, A gets 25% and B gets 75% of available CPU when both compete for it.
Units: 1 CPU = 1000 millicores. 250m means 25% of one core's scheduling time per period.
How CPU limits work
Limits are enforced via CFS quota. On cgroup v1, this is cpu.cfs_quota_us and cpu.cfs_period_us. On cgroup v2, cpu.max. The default period is 100ms. A limit of 500m means the container may run for 50ms out of every 100ms window. If it exhausts that budget, the kernel throttles it until the next period starts, even if the node has 100% idle CPU.
That last point matters. CPU throttling is not a signal that the node is overloaded. It is a signal that the container's own limit is too low for its burst pattern.
The "should I set CPU limits?" debate
The Kubernetes community is split. The official Kubernetes blog argues limits are valuable for capacity planning predictability, especially in multi-tenant clusters. Others, including Sysdig and learnk8s, argue that CPU limits mostly hurt latency-sensitive services because CFS proportional sharing already protects against starvation through requests alone.
My position: set CPU requests always. Set CPU limits only when you have a specific reason (multi-tenant isolation, cost attribution, or batch workloads you want to bound). For latency-sensitive services, I have seen too many cases where CPU throttling caused latency spikes on nodes that were barely loaded.
Go runtime gotcha
Before Go 1.25, the Go runtime sets GOMAXPROCS to the node's total CPU count, not the container's limit. On a 32-core node with a 1-core CPU limit, Go spawns 32 threads. They exhaust the 100ms CFS quota almost instantly, causing severe throttling and amplified GC pauses. Fix: use uber-go/automaxprocs for Go < 1.25. Go 1.25+ handles this natively.
Memory requests and limits
Memory is an incompressible resource. When a container exceeds its memory limit, the kernel does not slow it down. It kills the process. No graceful degradation, no warning. The pod restarts with exit code 137 and reason OOMKilled.
How memory requests work
Requests reserve capacity for scheduling: the node must have at least this much memory available. On cgroup v2 clusters with the MemoryQoS feature gate enabled (alpha, requires cgroup v2), the request maps to memory.min, guaranteeing that amount can never be reclaimed by the kernel, even under node memory pressure.
Units: bytes with binary suffixes. Use Mi (mebibytes) and Gi (gibibytes), not M and G. The difference is roughly 5%, and Kubernetes interprets them literally.
How memory limits work
Limits set the hard ceiling via memory.limit_in_bytes (cgroup v1) or memory.max (cgroup v2). Exceed it by a single byte, and the kernel OOM killer terminates the container. The kubelet detects the kill, logs the OOMKilled event, and restarts the container per its restart policy.
Two distinct OOM scenarios
- Container OOM. The container exceeds its own cgroup memory limit. The kernel kills that container. The pod restarts. This is the common case and appears as
OOMKilledinkubectl describe pod. - Node OOM. The node runs out of memory before the kubelet's eviction manager can act. The kernel OOM killer picks victim processes across the entire node. QoS class determines which pods die first. This is rarer but far more severe.
Why you should (almost) always set memory limits
Unlike CPU, where proportional sharing under contention provides natural protection, memory has no fallback. A container without a memory limit can grow until it triggers node-level OOM, taking down neighbors. Set memory limits for every production workload.
QoS classes and eviction priority
Kubernetes assigns a QoS class to every pod based on how requests and limits are configured. The QoS class determines eviction priority when the node runs out of resources.
Guaranteed
Every container in the pod has CPU and memory requests equal to their limits, and both are set. This is the most protected class. Guaranteed pods are evicted only after all Burstable and BestEffort pods have been removed. They are also eligible for exclusive CPU core assignment when the CPU Manager static policy is enabled.
The tradeoff: no burst capacity. The pod always pays the full scheduling cost of its resources, even when it uses a fraction.
# Guaranteed QoS example
resources:
requests:
cpu: "500m"
memory: "256Mi"
limits:
cpu: "500m" # equals request
memory: "256Mi" # equals request
Burstable
At least one container has a request or limit, but the pod does not meet Guaranteed criteria. This is the most common QoS class in production. Burstable pods can consume beyond their requests when the node has slack, and they are evicted after BestEffort pods but before Guaranteed.
# Burstable QoS example
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "1000m" # higher than request
memory: "512Mi" # higher than request
BestEffort
No container in the pod has any request or limit for CPU or memory. First to be evicted under any node pressure. Acceptable for non-critical batch jobs or dev environments. Not for production services.
Eviction order within the same QoS class
When the kubelet needs to evict pods and multiple pods share the same QoS class, it ranks them by (current_usage - request) / request. The pod consuming the most relative to its request is evicted first. Setting accurate requests matters: a pod with low requests and high actual usage becomes the top eviction candidate in its class.
What requests and limits are not
This section exists to prevent the most common misunderstandings I see.
Limits do not affect scheduling. The kube-scheduler does not read limits. A pod with request: 100m and limit: 8000m will be scheduled to a node with just 100m available, then potentially burst to 8 cores and starve neighbors. The official documentation is explicit: "the kube-scheduler uses [requests] to decide which node to place the Pod on."
CPU limits do not protect the node from starvation. Under contention, CFS distributes CPU proportionally to requests. A pod without limits will not hog the node when there is actual contention. Limits only cap individual containers regardless of contention.
"OOMKilled" is not the same as "evicted." Eviction is a kubelet-initiated process that respects pod disruption budgets (soft eviction) or fires immediately (hard eviction). OOMKill is a kernel signal that terminates a process because it exceeded its cgroup memory limit. They are separate mechanisms with different causes and different remediation paths.
Memory is not "throttled." Only CPU is throttled (slowed down). Memory is killed. If someone says "the pod was throttled for memory," they probably mean it was OOMKilled, which is a fundamentally different failure mode.
Setting initial values
Starting from scratch with no production data? Three methods, in order of preference.
VPA recommendation mode
Deploy the Vertical Pod Autoscaler with UpdateMode: Off alongside your workload. Let it collect usage data for 7 to 14 days. Read the recommendations with kubectl describe vpa <name>. Validate the numbers against actual load patterns before applying.
Set minAllowed and maxAllowed bounds to prevent runaway recommendations. And do not use VPA and HPA on the same resource dimension (both targeting CPU causes conflicting scaling decisions).
Load testing with profiling
Deploy with generous initial limits. Run representative load tests (k6, Gatling, Locust) against the pod. Monitor container_memory_working_set_bytes and container_cpu_usage_seconds_total in Prometheus. Take P95 memory usage under peak load as the memory request. Add a 20% buffer for the memory limit. Take P95 CPU usage as the CPU request. Decide on a CPU limit based on the workload type (latency-sensitive: no limit or high headroom; batch: tight limit).
kubectl top as a starting point
For running workloads with no observability stack yet, kubectl top pods gives real-time usage. Compare against current requests and limits. A rough starting point: request = 1.1 to 1.2x average observed usage.
| Workload type | CPU request | CPU limit | Memory request | Memory limit |
|---|---|---|---|---|
| Stateless web API | 100–250m | None or 2–4x request | 128–512Mi | 1.5–2x request |
| Background worker | 100–500m | 2x request | 256Mi–1Gi | 1.2–1.5x request |
| Database sidecar | 50–100m | 2x request | 64–128Mi | 2x request |
| Critical stateful | Expected usage | = request (Guaranteed) | Expected usage | = request (Guaranteed) |
| Batch job | 100–500m | 2x request | 256Mi–2Gi | = request |
These are starting points. Replace them with observed data as soon as you have it.
Overcommit strategy
Overcommit occurs when the sum of container limits across a node exceeds physical node capacity. The sum of requests can never exceed node capacity, because the scheduler prevents it. But limits can.
Why overcommit works
Applications exhibit bursty usage patterns. Web services typically consume 10 to 30% of allocated resources during normal operation, with spikes to 60 to 80%. If every pod ran at its limit simultaneously, the node would be overloaded. In practice, that rarely happens, and overcommit improves cluster density.
CPU vs. memory overcommit
CPU overcommit is low risk. Under contention, CFS shares protect pods proportionally to their requests. A node with 200% CPU overcommit (limits sum to 2x node CPU) handles contention gracefully through throttling.
Memory overcommit is high risk. There is no graceful degradation. If pods collectively exceed available memory, the kernel starts killing processes. Conservative memory overcommit (limits = 1.0 to 1.5x requests) is the norm.
The fixed-fraction headroom pattern
The official Kubernetes blog recommends limits = requests x (1 + small_percentage). Setting limits at 1.1x to 1.2x requests gives pods a small burst budget while bounding total overcommit. This produces Burstable QoS (not Guaranteed), which is the right tradeoff for most workloads.
Where to go next
This article covers the core mechanics. For specific failure modes and operational tasks built on these concepts:
- When a pod restarts with exit code 137, the OOMKilled troubleshooting guide walks through diagnosis and memory limit sizing
- When latency spikes without visible load, the CPU throttling guide covers CFS mechanics, how to read throttling metrics, and arguments for removing CPU limits
- When the cluster bill is too high but the workloads are healthy, the cost optimization guide walks through rightsizing, namespace quotas, and spot instance integration