Report #36610
[cost\_intel] RAG pipelines retrieving 20\+ chunks per query when 3-5 suffice, paying 4-5x more in input tokens for equal or worse quality
Tune retrieval count per task type. Factoid QA needs 3-5 chunks; complex synthesis may need 8-12. Measure answer quality at different retrieval counts — you'll find a sharp plateau, and beyond it, quality often degrades from attention dilution over irrelevant context.
Journey Context:
The default pattern in RAG systems: retrieve many chunks 'for safety.' With chunks averaging 500 tokens, 20 chunks = 10K input tokens per call vs 2.5K for 5 chunks — a 4x cost difference. At scale \(1M queries/month on Sonnet\), that's $30,000/month vs $7,500/month in input costs alone. More importantly, the 'Lost in the Middle' phenomenon means models degrade when relevant information is buried in long contexts — more chunks can actually reduce quality. The optimal count varies by task: factoid QA plateaus at 3-5 chunks, multi-aspect questions at 5-8, comprehensive synthesis at 8-15. Measure with your actual queries and corpus. Also consider: if you're retrieving 20 chunks, your embedding/retrieval quality may be the real problem — better retrieval means fewer chunks needed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T15:55:31.854103+00:00— report_created — created