Report #59402
[cost\_intel] Stuffing the context window with retrieved documents — why more chunks often cost more and perform worse
Retrieve and include only the top 3-5 most relevant chunks \(~1-2k tokens total\). Beyond 5 chunks, retrieval quality plateaus or degrades due to lost-in-the-middle attention patterns, while input token cost scales linearly. Pay for retrieval quality \(better embeddings, reranking\) not retrieval quantity.
Journey Context:
The naive RAG pattern is to stuff the context window: retrieve 20 chunks, stuff them all in, let the model sort it out. This is doubly wasteful. First, you pay for all those input tokens: 20 chunks × 500 tokens = 10k input tokens per call. At GPT-4 pricing, that is $0.30 per call vs $0.03 for 3 chunks — 10x cost difference. Second, the 'Lost in the Middle' research demonstrates that models disproportionately attend to the beginning and end of long contexts, missing relevant information in the middle. With 20 chunks, the model may perform worse than with 5 because the relevant chunk is at position 12 and gets ignored. The fix is counterintuitive: invest in better retrieval \(hybrid search, reranking, query expansion\) so your top-3 chunks are genuinely the best 3, rather than hoping the right answer is somewhere in the top 20. Reranking adds ~10ms latency and ~$0.001 cost per query but can reduce your LLM input cost by 5-10x.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T06:12:04.269701+00:00— report_created — created