I spent an embarrassing amount of time last year convinced that CPU limits in Kubernetes
worked like memory limits — that setting limits.cpu: 500m meant a container
would be killed if it tried to use more than half a core. It doesn't work that way at all.
What actually happens is CPU throttling via the kernel's CFS (Completely Fair Scheduler) quota mechanism. The container gets a burst allowance every 100ms, and if it exhausts that quota before the period ends, it gets throttled — suspended — until the next period. Nobody kills your container. It just silently gets slower.
Why this matters for latency
The practical consequence: a container with limits.cpu: 500m can fully use
one CPU core for 50ms out of every 100ms period. If a request arrives at the wrong moment —
when the container has already burned through its quota — it waits up to 50ms before any
work starts. Not 50ms of actual compute, 50ms of doing nothing.
This shows up in p99 latency in a very specific pattern: random spikes that are multiples of 100ms (usually 50ms, 100ms, 150ms). If you see this pattern in your tail latencies and your CPU utilization looks fine, you're probably CPU throttled. Check with:
kubectl top pods --containers=true
# also check throttling specifically:
# cat /sys/fs/cgroup/cpu/cpu.stat | grep throttled
The cpu.stat file shows nr_throttled and
throttled_time. If throttled_time is non-zero and growing, you found it.
The fix I actually use
For latency-sensitive services: remove CPU limits entirely and rely on CPU requests for scheduling. This means the container can burst above its request allocation when the node has spare capacity, which is almost always. The risk is noisy-neighbor starvation, but in practice I've found this less of a problem than the throttling latency spikes.
For batch/background workloads: keep limits, but set them higher than you think necessary. I usually set limits at 3-4x the request value, which gives headroom for bursts without the starvation risk. The 100ms window is short enough that brief bursts don't affect overall scheduling fairness much.
What I got wrong about memory limits too
While I was down this rabbit hole I also revisited memory limits. Memory limits do kill containers — via OOMKiller — but the threshold isn't the limit value, it's the limit minus whatever the kernel has cached. If you have a container with a 512Mi limit and 400Mi of working set, a burst of reads can fill the remaining 112Mi with page cache and trigger an OOM even though your application "only" used 400Mi.
The container_memory_working_set_bytes metric is what Kubernetes actually
watches (not container_memory_usage_bytes). Make sure your alerts use the
right one.
Tooling note
kube-capacity is a kubectl plugin that shows requests and limits across nodes in a readable format. I use it almost daily. VPA (Vertical Pod Autoscaler) in recommendation mode is also worth running — not to act on all its suggestions, but to spot containers where the request/limit ratio is way off.
For the throttling specifically, I wrote a small Prometheus recording rule that tracks throttle ratio per container. If you want it, email me and I'll paste the YAML.