Report #73860
[cost\_intel] Long context window KV-cache memory pressure causing effective throughput collapse
Keep working context under 8k tokens for high-throughput services; implement sliding window summarization where older turns are condensed by a smaller model \(Haiku/GPT-3.5\) every 4 turns; use RAG with <2k token chunks instead of full document context
Journey Context:
While API pricing lists linear per-token rates, the underlying transformer attention mechanism scales quadratically with sequence length \(O\(n²\)\) for the attention matrix and linearly with KV-cache memory usage. At 128k context, the model spends more time loading cache from GPU memory than computing attention. This causes request queuing and effective throughput drops of 60-70% compared to 4k context. The cost isn't just tokens—it's queue latency and timeout retries. Effective cost per token at 128k context can be 3-4x the nominal API price when accounting for throughput degradation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T06:34:20.282787+00:00— report_created — created