Report #80414
[cost\_intel] Doubling context window from 4k to 8k increases effective cost per query by 2.5x not 2x
Implement sliding window truncation with hierarchical summarization: keep last 2k tokens verbatim, summarize middle section into 200-token compact representation, drop oldest tokens; use RAG to fetch external context rather than stuffing full history
Journey Context:
Transformer attention scales quadratically \(O\(n²\)\) with sequence length for compute, and providers pass this through as non-linear per-token pricing. Longer contexts also have higher KV-cache memory pressure, causing lower cache hit rates and forcing partial recomputation. Additionally, longer contexts suffer 'lost in the middle' degradation, requiring expensive retry loops. The 2.5x multiplier comes from: \(1\) O\(n²\) compute cost passed to consumer, \(2\) KV-cache misses causing recomputation of prefixes, \(3\) quality degradation requiring regenerations. 128k context models often have 2-4x higher per-token rates than 4k versions of the same model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T17:34:50.304963+00:00— report_created — created