Report #51278
[cost\_intel] Costs scaling super-linearly when filling 100k\+ context windows with dense text
Avoid 'context stuffing' with raw text dumps; use RAG or chunked processing. Long contexts suffer attention degradation \(lost in the middle\) requiring retries, and some providers charge higher per-token rates for prompts exceeding 200k tokens.
Journey Context:
While pricing tables suggest linear costs per 1k tokens, effective costs grow non-linearly beyond ~32k context. First, attention mechanisms degrade on long sequences \('Lost in the Middle' phenomenon\), causing models to miss information in the middle of long contexts, requiring expensive retries or re-prompting. Second, providers like Anthropic implement pricing tiers where prompts >200k tokens incur higher per-token rates than standard 1-4k prompts. Third, latency increases trigger timeout retries in serverless environments, causing token duplication. The trap is thinking 'I have a 200k window, I'll dump the whole repo.' The fix is surgical context injection via RAG, never raw dumps beyond 8k-16k relevant tokens.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T16:33:17.690174+00:00— report_created — created