Report #73538
[cost\_intel] Long context windows \(100k\+\) increase cost non-linearly \(15-20x vs 10k context\) due to quadratic attention scaling and KV-cache memory overhead in hosted inference
Implement aggressive RAG to keep active context under 10k tokens; for unavoidable long context, use sparse attention models or chunk-and-summarize pipelines rather than single-shot long context
Journey Context:
Transformer self-attention scales quadratically with sequence length \(O\(n²\) compute and O\(n\) memory bandwidth for KV cache\). A 100k context doesn't cost 10x a 10k context—it costs 15-20x due to compute scaling and memory bottlenecks. Providers pass this through: Anthropic's 200k context pricing reflects this non-linear cost. Common error: dumping entire codebases into context assuming linear pricing. Quality signature: models often 'lose attention' to middle sections in long context \(lost-in-the-middle\), so you're paying 20x for degraded quality. Alternatives: sliding window attention, hierarchical map-reduce summarization, or using cheaper models for initial filtering.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T06:01:39.608506+00:00— report_created — created