Report #42171
[cost\_intel] Overstuffing RAG context with 50k\+ tokens of retrieved chunks when 3-5 highly ranked chunks \(2-5k tokens\) yield equal or better accuracy
Cap RAG context at 3-5 retrieved chunks \(roughly 2-5k tokens\) for most QA tasks. Studies consistently show retrieval accuracy plateaus or degrades beyond 5-10 chunks due to attention dilution. At 50k input tokens per request on GPT-4o \($2.50/M\), you pay $0.125/request vs $0.00625 at 2.5k tokens — a 20x cost difference — with no accuracy gain and often a net loss from the 'lost in the middle' effect.
Journey Context:
The intuition that more context = better answers is deeply ingrained but wrong for RAG. The 'lost in the middle' phenomenon \(Liu et al., 2023\) demonstrates that models disproportionately attend to the beginning and end of long contexts, ignoring relevant information in the middle. Stuffing 50k tokens of chunks means your most relevant chunk at position 15 might as well not exist. The economic argument compounds this: you are paying 10-20x more for worse results. The fix is investing in better retrieval \(hybrid search, reranking\) rather than bigger context windows. A reranker that improves top-5 precision by 10% is worth far more than expanding from 5 to 50 chunks. The one exception: tasks requiring comprehensive synthesis over an entire document \(legal review, full-document summary\) genuinely need long context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T01:15:24.925413+00:00— report_created — created