Report #70568
[cost\_intel] Over-retrieving RAG chunks—fetching 10-20 chunks when 3-5 suffice for answer quality
Default to retrieving 3-5 chunks with a similarity score threshold of 0.7-0.8 cosine. Each additional chunk adds 500-1500 input tokens per query with diminishing returns after 3-5 chunks. For smaller models, over-retrieval actively degrades answer quality through attention dilution.
Journey Context:
The intuition that more context equals better answers breaks down in RAG for two reasons. First, cost: each retrieved chunk adds tokens linearly. Retrieving 10 chunks at 800 tokens each equals 8K input tokens per query vs 3 chunks at 2.4K tokens—a 3.3x cost difference per query. Second, attention dilution: smaller models \(Haiku, Flash, GPT-4o-mini\) show measurable quality degradation when given too many partially-relevant chunks. The model attends to irrelevant information and produces less focused, more generic answers. This effect is less pronounced but still present in frontier models. Optimal chunk count by task type: factoid QA peaks at 2-3 chunks, complex synthesis tasks at 5-7 chunks, comprehensive analysis at 8-10 chunks. The degradation signature for over-retrieval: answers become generic and could apply to any document on the topic, include contradictory information from marginally relevant chunks, or hallucinate connections between unrelated retrieved passages. Use similarity thresholds to filter low-relevance chunks before sending to the model. A chunk with 0.65 cosine similarity is more likely to confuse than help.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T01:02:05.742712+00:00— report_created — created