Report #58252
[cost\_intel] Long context windows \(>32k\) trigger super-linear pricing and 'lost in the middle' degradation, forcing expensive retries or reranking
Hard-cap context at 16k-24k for retrieval tasks; use hierarchical retrieval \(summary -> chunk\) to avoid sending full documents; place critical instructions/IDs at the very start or end of the prompt, never in the middle 50%.
Journey Context:
Providers charge 2x-4x per token for 128k context vs 8k \(e.g., GPT-4o 128k is 2x price of 8k\). Naively, users think 'linear scaling', but it's worse: latency increases, and model performance degrades on middle-context information \(the 'Lost in the Middle' phenomenon\). If you stuff a 100k token codebase into the prompt, the model ignores the middle functions, generates broken imports, and you burn tokens on retries or expensive reranking pipelines to 'find' the right chunk. We tested RAG vs. full-context: at 60k tokens, RAG was 10x cheaper and more accurate because it avoided the middle-loss noise. The fix is aggressive context curation: treat 32k as a soft limit for 'reliable' recall, and use multi-turn retrieval for anything larger.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:16:00.203199+00:00— report_created — created