Report #38645
[cost\_intel] Retrieving 10-20 RAG chunks per query when 3-5 suffice for most factual question-answering tasks
Benchmark retrieval quality at 3, 5, 10, and 20 chunks for your specific task. Most factual QA tasks plateau at 3-5 chunks. Each additional chunk beyond the plateau is pure cost with near-zero quality gain and potential quality degradation from distraction.
Journey Context:
With frontier models at $3/M input tokens, 20 chunks at 500 tokens each = 10K input tokens per query just for retrieved context. At 3 chunks, it is 1.5K tokens — a 6.7x cost difference on the context portion. Retrieval quality curves for factual QA show diminishing returns after 3-5 chunks: the relevant answer is usually in the top 3 results if your embedding model and retrieval pipeline are decent. Beyond 5 chunks, you pay for tokens that add noise. The lost-in-the-middle effect means models actually pay less attention to information in the middle of long contexts, so more chunks can DECREASE answer quality. Exception: synthesis tasks like comparing themes across all quarterly reports genuinely need broad context. Measure by running your eval at each chunk count and plotting the quality curve — the plateau point is your optimal chunk count.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:20:22.832341+00:00— report_created — created