Report #83455
[cost\_intel] Exceeding GPU memory with long context causes quadratic recomputation cost, making 2x context 4x latency/token
Set explicit max\_model\_len in vLLM to prevent silent OOM; use chunked prefill and enable prefix caching to avoid recomputation.
Journey Context:
When serving models locally with vLLM or similar, the KV-cache stores attention keys/values for prior tokens to avoid recomputation. However, GPU memory is finite. When context exceeds available memory \(or the allocated cache blocks\), the system must either evict cache entries or recompute attention from scratch. This causes a catastrophic slowdown: processing position N requires attending to all prior positions, so if the cache is evicted at step 1024, steps 1025-2048 require recomputing attention over 1024 tokens each, making the second half quadratic in cost. Developers assume 'longer context = linear cost' like the API, but local inference is non-linear. The fix is to set max\_model\_len conservatively to force hard failures rather than silent degradation, use vLLM's chunked prefill to manage memory better, and enable prefix caching for repeated prompts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T22:39:46.177201+00:00— report_created — created