Kubernetes CPU throttling: why pods stall at low utilisation

A pod shows 12% average CPU in Grafana yet gets throttled 60% of the time. The cause is not the node being overloaded. It is the Linux CFS scheduler enforcing a per-100 ms time budget that monitoring dashboards smooth into invisibility. This article explains the mechanism, shows how to measure it, and lays out the remediation options with their tradeoffs.

How CFS bandwidth control works

The Linux Completely Fair Scheduler (CFS) enforces CPU limits through two cgroup parameters: a period and a quota. Kubernetes sets the period to 100,000 microseconds (100 ms) by default. The quota is derived from the pod spec's CPU limit.

A CPU limit of 500m translates to a quota of 50 ms per 100 ms period. The container may consume up to 50 ms of CPU time, summed across all its threads, within each period window. Once the quota is exhausted, the kernel halts every thread until the next period boundary. Unused quota does not roll over. There is no borrowing from future periods.

On cgroups v1, these values live in cpu.cfs_quota_us and cpu.cfs_period_us. On cgroups v2, both are combined into a single file cpu.max (e.g. 50000 100000). The enforcement model is identical in both versions.

Why low average CPU does not mean no throttling

This is the central misconception. Three mechanisms explain the gap between dashboard averages and real throttling.

Burstiness within the 100 ms window. Monitoring tools (Grafana, kubectl top, cloud consoles) typically display 1-minute or longer averages. A workload that concentrates a 60 ms CPU burst into a single 100 ms period exhausts a 50 ms quota and gets throttled for the remaining 40 ms. Across the minute, average CPU looks like 5%. The throttle ratio says 60%.

Multithreaded workloads amplify quota exhaustion. Quota is shared across all threads. Ten threads, each needing 50 ms on separate cores, with a 2 CPU limit (200 ms quota), collectively exhaust that quota in 20 ms of wall time. They are then frozen for the remaining 80 ms. Reported average CPU: exactly 2.0. Completion time: more than tripled.

The scrape interval smooths spikes into nothing. A standard Prometheus rate() over [5m] aggregates 300 seconds of data. A workload throttled for 80 ms every 500 ms will show a low throttle percentage, but p99 latency is dominated by those pauses. Average metrics lie; percentile latency tells the truth.

Which workloads get hit hardest

Any workload with short, intense CPU bursts is a candidate. The most common offenders:

JVM applications. Garbage collection, especially G1 GC, creates 10 to 50 ms CPU spikes. Under a tight CPU limit, GC threads themselves are throttled, extending stop-the-world pauses, building up Old Gen pressure, and cascading into service timeouts.
Go applications. The Go runtime calculates GC parallelism as 25% of GOMAXPROCS. Under throttling this assumption breaks, producing GC pressure that inflates tail latency.
HTTP APIs during request spikes. Each burst of incoming requests concentrates CPU use into narrow windows. The 100 ms enforcement period creates artificial queuing.
JIT-compiled runtimes (Node.js V8, JVM JIT). Startup and warmup phases are CPU-intensive bursts that look nothing like steady-state averages.

How to detect throttling

Kubernetes does not surface throttling in kubectl describe pod or kubectl top. The silence is the problem: many teams run throttled pods for weeks, diagnosing the symptoms as slow network or database contention.

Detection requires Prometheus (see the Prometheus monitoring tutorial). The key metrics, exposed by cAdvisor:

Metric	What it measures
`container_cpu_cfs_throttled_periods_total`	Number of periods in which the container was throttled
`container_cpu_cfs_periods_total`	Total elapsed enforcement periods
`container_cpu_cfs_throttled_seconds_total`	Cumulative CPU time lost to throttling

The throttled-period ratio is the standard detection query:

100 * (
  sum by (namespace, pod, container) (
    rate(container_cpu_cfs_throttled_periods_total{container!="", container!="POD"}[5m])
  )
  /
  sum by (namespace, pod, container) (
    rate(container_cpu_cfs_periods_total{container!="", container!="POD"}[5m])
  )
)

The kubernetes-mixin CPUThrottlingHigh alert fires at 25% throttled periods. In practice, anything above 25% sustained for 15 minutes warrants investigation. Above 50% is almost certainly impacting latency.

To correlate: compare throttle ratio with p99 latency for the same service. If both spike together, throttling is the cause.

Cgroups v2: what changed and what did not

Kubernetes 1.25 made cgroups v2 GA (August 2022). Most modern distributions (Ubuntu 21.10+, Debian 11+, RHEL 9+, Fedora 31+) default to v2. The requirements: Linux kernel 5.8+, containerd 1.4+ or CRI-O 1.20+, and the systemd cgroup driver.

What changed: the filesystem paths moved from /sys/fs/cgroup/cpu,cpuacct/kubepods/ to /sys/fs/cgroup/kubepods/. Quota and period merged into cpu.max. The cpu.stat file switched from nanoseconds to microseconds. Pressure Stall Information (PSI) became available.

What did not change: the 100 ms default period, the fundamental throttling behavior, the Prometheus metric names (cAdvisor abstracts the v1/v2 difference), and the user-space CPU limit syntax in pod specs. If your tooling reads cgroup files directly, update it. If you rely on Prometheus, nothing changes.

Remediation options and tradeoffs

There is no single right answer. Each option trades something different.

Raise CPU limits. The simplest fix. If the throttle percentage is consistently above 25%, increase the limit to 2 to 4x the request. Monitor the throttle ratio afterward; it should drop sharply. The risk: in multi-tenant clusters, higher limits compete with other pods for node CPU headroom. But limits do not affect scheduling, only enforcement. See resource requests and limits for the full mechanics.

Remove CPU limits entirely. The most aggressive option. The pod runs Burstable QoS and can consume any idle CPU on the node. For latency-sensitive services on dedicated or lightly shared nodes, this is often the right call. On shared multi-tenant clusters, a runaway pod can starve neighbors. Also breaks HPA configured against CPU utilisation as a percentage of limit, and drops the pod from Guaranteed to Burstable QoS, changing its eviction priority.

Adjust the CFS period. The kubelet's cpuCFSQuotaPeriod setting (default 100ms) can be increased to 200 ms or 500 ms. A wider window lets bursty workloads accumulate more CPU time before hitting the ceiling. The tradeoff: it is a node-wide setting (not per-pod), it reduces scheduling fairness between competing processes, and some managed Kubernetes services do not expose it.

CPU burst (experimental). Linux 5.14 introduced cpu.cfs_burst_us (v1) / cpu.max.burst (v2), a credit system that accumulates unused quota from quiet periods. Kubernetes does not expose this as a native pod spec field yet (tracking issue). It can be set via a DaemonSet patching cgroup files, but this is not yet suitable for general production use without custom infrastructure.

Use VPA for right-sizing. The Vertical Pod Autoscaler analyzes historical CPU usage and adjusts requests automatically. In Off mode it provides recommendations without enforcement. VPA modifies requests by default; limits require explicit RequestsAndLimits mode. It is a good starting point for workloads where you have no historical data.

What CPU throttling is NOT

Not a signal of node overload. A pod can be throttled on a node with 90% idle CPU. Throttling is per-container, enforced by the kernel against that container's own cgroup limit.

Not visible in standard Kubernetes tooling. kubectl describe pod does not show throttling events. kubectl top shows usage, not throttle rate. Without Prometheus, throttling is invisible.

Not the same as CPU contention. Under contention (multiple pods competing for CPU), the CFS distributes time proportionally to requests. That is sharing, not throttling. Throttling happens when a single container exceeds its own quota, regardless of what other pods are doing.

Not fixable by adding more nodes. Scaling horizontally does not help if the individual pod's CPU limit is too low for its burst pattern. The limit is per-container, not per-node.

Recurring server or deployment issues?

I help teams make production reliable with CI/CD, Kubernetes, and cloud—so fixes stick and deploys stop being stressful.

Explore DevOps consultancy