Report #74147
[cost\_intel] Stuffing the full context window degrades quality and linearly increases cost without proportional information retrieval
Cap RAG context to top-3 chunks \(~1,500 tokens\) instead of top-10; quality plateaus while cost scales linearly with input tokens.
Journey Context:
A common RAG mistake is retrieving 10 chunks \(often 5k-10k tokens\) to 'ensure the answer is there.' Input token cost scales linearly, so 10k tokens costs 10x more than 1k. However, LLM recall quality follows a log curve: it spikes at top-1 to top-3 chunks and plateaus or even degrades \('lost in the middle' phenomenon\) beyond that. By aggressively filtering to top-3 with a high similarity threshold, you reduce RAG input costs by 70% and actually improve answer quality by reducing noise. Small models are especially sensitive to context noise, dropping 15% in accuracy when distracted by irrelevant chunks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T07:03:12.509232+00:00— report_created — created