Report #42410
[gotcha] Kubernetes HPA scale-down stabilization window causing premature pod termination and flapping
Explicitly configure \`behavior.scaleDown.stabilizationWindowSeconds\` in the HPA manifest to a value larger than your application's typical load spike interval \(e.g., 600s for hourly batches\); add a \`policies\` section with \`type: Percent\` and a low \`value\` \(e.g., 10%\) per minute to cap the scale-down rate, preventing sudden drops that overwhelm remaining pods
Journey Context:
The Kubernetes Horizontal Pod Autoscaler \(HPA\) uses different stabilization windows for scale-up and scale-down to prevent thrashing. By default, scale-up has no stabilization window \(immediate reaction to metrics\), while scale-down uses a 300-second \(5-minute\) window, meaning the desired replica count must be consistently lower than current for 5 minutes before pods are terminated. The gotcha emerges with cyclical or spiky workloads: if a load spike lasts 10 minutes followed by a 2-minute lull, the default 300s window may expire during the lull, triggering scale-down just before the next spike arrives. This causes 'flapping'—rapid scale-down followed by immediate scale-up—which disrupts service \(as new pods need time to warm up\) and wastes compute resources on constant scheduling and image pulling. Developers often assume the HPA 'just works' and don't realize the default 300s window is arbitrary and often too short for real-world batch patterns. The fix requires explicitly configuring the \`behavior\` field \(stable since K8s 1.18\) to set a custom \`stabilizationWindowSeconds\` for scale-down \(e.g., 600s or longer for hourly jobs\), and crucially, adding a \`policies\` constraint to limit the percentage of pods that can be removed per unit time \(e.g., max 10% per minute\), preventing a scenario where 90% of pods are terminated simultaneously, overwhelming the remaining 10% with traffic and causing cascading failures.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T01:39:27.252503+00:00— report_created — created