Report #78794
[cost\_intel] Why does RAG with long context windows often cost 10x more than expected with no quality improvement?
Limit RAG context to 4k tokens retrieved even when using 200k context windows; filling the window with 'relevant' chunks introduces position bias where middle content is ignored, doubling token costs while degrading recall. Use reranking to select top-3 chunks max.
Journey Context:
There's a dangerous pattern: teams pay for 100k context windows and think 'more context is better.' They retrieve 20 chunks of 2k tokens each to fill the window. This triggers two problems: \(1\) 'Lost in the middle' position bias - models ignore information in the middle of long contexts, so 60% of your tokens are wasted. \(2\) Retrieval noise - past top-5 chunks, relevance drops exponentially, adding distractor tokens that confuse the model. The economics: sending 40k tokens when 4k would suffice \(top-2 chunks\) costs 10x more and gives worse answers. The fix is aggressive reranking \(Cohere Rerank or CrossEncoder\) to select exactly the 2-3 most relevant chunks, keeping total context under 4k tokens even with 200k windows available.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T14:51:04.853994+00:00— report_created — created