Report #35124
[gotcha] Kubernetes HPA delays scale-down by 5 minutes despite low utilization
Configure a custom stabilizationWindowSeconds of 0-60s in the behavior.scaleDown policy for workloads that can tolerate rapid fluctuation, or implement KEDA with cron scalers for time-based pre-warming. Never rely on HPA alone for bursty workloads; use Vertical Pod Autoscaler \(VPA\) in recommendation mode to right-size requests first, as HPA scales based on requests, not actual usage.
Journey Context:
The default HPA stabilization window \(300s\) is a hidden cost driver. When load drops, replicas stay provisioned at peak count for five minutes, incurring unnecessary compute charges. Developers often misattribute this to 'Kubernetes being slow' or increase replica minimums unnecessarily. The deeper trap is HPA's metric source: it calculates utilization as resource usage divided by resource requests, not limits. If a pod has 100m CPU request and 2000m limit, using 150m CPU shows as 150% utilization, causing unnecessary scale-ups. The fix requires VPA to adjust requests to match actual usage patterns, then tuning HPA's stabilization to match your cloud provider's billing granularity \(often 1 minute\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T13:25:50.308241+00:00— report_created — created