Report #45956
[cost\_intel] Stuffing maximum context chunks into RAG prompts 'just in case'
Retrieve 3-5 chunks max for most QA tasks. Each additional chunk adds ~500 tokens of input cost while providing diminishing returns after the top 3. At 10\+ chunks, you're paying 2-3x per query for <5% quality improvement — and potentially increasing hallucination.
Journey Context:
RAG pipelines often retrieve 10-20 chunks 'to be safe,' but retrieval utility follows a sharp diminishing-returns curve. The top 3 chunks typically contain the answer-relevant information; chunks 4-10 add <5% recall while 2-3x'ing input token costs. On GPT-4o at $2.50/1M input, 5 chunks × 500 tokens = 2,500 tokens \($0.00625\) vs 15 chunks × 500 tokens = 7,500 tokens \($0.01875\) — 3x the cost for marginal quality gain. Worse: excessive context increases hallucination rates as models attend to irrelevant passages that introduce conflicting information \('lost in the middle' effect\). Tune chunk count on a held-out set: most factoid QA tasks plateau at 3-5 chunks. Reserve 10-20 chunk retrieval for tasks explicitly requiring synthesis across many sources \(literature reviews, comparative analysis\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:36:45.904020+00:00— report_created — created