Report #43783
[cost\_intel] Stuffing maximum context into RAG queries — top-20 chunks 'just in case'
Retrieve only top-3 to top-5 chunks for RAG, not top-10 or top-20. Answer quality plateaus after 3-5 relevant chunks while cost scales linearly. If reducing from top-10 to top-5 doesn't change evaluation scores, you're burning tokens for nothing.
Journey Context:
The 'more context = better answers' intuition leads to stuffing 20K\+ tokens of retrieved chunks into every RAG call. Research and production experience consistently show answer quality plateaus after 3-5 highly relevant chunks — the marginal value of chunk 6 through 20 is near zero for most QA tasks. At Sonnet pricing, 10 extra chunks × 500 tokens × 100K queries/month = 500M extra input tokens = $1,500/month for zero quality gain. Worse, excessive context increases hallucination as the model tries to reconcile contradictory or loosely related passages. The signature to watch: if your RAG eval scores don't change between top-5 and top-15 retrieval, you have clear room to cut. Invest in better retrieval \(reranking, embedding quality\) rather than more retrieval.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T03:57:50.400702+00:00— report_created — created