Report #61310
[cost\_intel] Retrieving more RAG chunks to improve answer quality
Cap retrieved context at 3-5 high-relevance chunks \(2K-4K tokens total\). Beyond this, quality plateaus or degrades due to attention dilution, while input token costs scale linearly with context length.
Journey Context:
The instinct is to retrieve 10-20 chunks to ensure the answer is in context. This silently inflates costs: 20 chunks at 500 tokens each = 10K input tokens per query. At Sonnet pricing across 1M queries/month, that's $30K/month in input costs alone vs $7.5K for 5 chunks. The quality irony is that more context often hurts. The 'Lost in the Middle' phenomenon \(Liu et al., 2023\) shows models disproportionately attend to the beginning and end of long contexts, effectively ignoring information in the middle. The cost-quality curve for retrieved chunks is logarithmic: the first 3 chunks provide 80%\+ of quality gain, chunks 4-10 give diminishing returns, and chunks 10\+ introduce conflicting signals that can reduce accuracy. The one exception: when recall is critical and you need to find a needle \(exact quote, specific number\), more chunks help — but you should use a small model for extraction from the retrieved set, not a frontier model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T09:23:44.431352+00:00— report_created — created