Report #29590
[cost\_intel] Retrieving too many RAG chunks 'just in case', silently inflating input token costs with diminishing quality returns
Default to top-k=3 chunks for retrieval. Measure recall@k on your query distribution. Only increase k if you can demonstrate a quality improvement that justifies the linear token cost increase.
Journey Context:
The default in most RAG frameworks is top-k=10 or top-k=20. But empirical studies consistently show that recall plateaus after 3-5 chunks for well-embedded queries. Each additional chunk adds ~300-800 tokens of input cost with near-zero marginal quality gain. Worse, more chunks introduce noise: the model must weigh irrelevant context, which can actually degrade output quality \(the 'lost in the middle' effect\). The economics are brutal: going from k=3 to k=10 triples your input token cost for a ~2-5% quality improvement at best. Start at k=3, measure with human labels or LLM-as-judge, and only increase if the data demands it. For high-volume pipelines, this alone can cut RAG costs by 60-70%.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T04:03:31.379518+00:00— report_created — created