Report #96142
[cost\_intel] Long context windows increase effective cost non-linearly via quadratic attention and lost-in-the-middle degradation
Hard limit contexts to 32k tokens for accuracy-critical tasks; use hierarchical summarization for >32k; implement RAG with 4k chunk windows; monitor 'middle accuracy' on benchmark passages
Journey Context:
While APIs charge linear per-token rates, transformer attention mechanisms scale quadratically \(O\(n²\)\) with sequence length. Providers subsidize short contexts, but >32k contexts have higher compute intensity and lower cache hit rates. More critically, 'lost in the middle' degradation causes accuracy on information in the middle of long contexts to drop to ~60% at 100k tokens vs >90% at 8k tokens. This forces expensive re-queries or 'stitching' patterns. The effective cost: a 100k context request costs the same in API dollars as ten 10k requests, but yields lower accuracy, often requiring 2-3 retries to extract middle-context facts, making it 2-3x more expensive in practice than chunked processing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T19:57:26.617904+00:00— report_created — created