Report #96239
[cost\_intel] RAG pipelines: sending too many retrieved chunks degrades quality AND increases cost simultaneously
Retrieve 3-5 highly relevant chunks instead of 10-20 loosely relevant ones. Use a two-stage retrieval: broad recall with embedding search, then rerank with a cross-encoder to select the top 3-5. This reduces token usage by 60-80% and often improves answer quality by reducing distractor context that causes the model to lose signal.
Journey Context:
The common pattern in RAG is to retrieve many chunks 'just in case' and stuff them all into the context. This has two costs: \(1\) direct token cost—10 chunks at 500 tokens each equals 5K tokens per query, vs 3 chunks at 1.5K tokens, a 3.3x difference; \(2\) quality cost—more chunks means more distractor information, and models \(especially smaller ones\) degrade when relevant information is buried in irrelevant context. This is the 'lost in the middle' phenomenon: models attend more to information at the beginning and end of the context, ignoring middle content. At 100K queries/month with 10 chunks each at 500 tokens: 500M tokens of context at $3/M equals $1,500/month. Trim to 3 chunks: 150M tokens equals $450/month. Add reranking cost \(negligible per query—cross-encoders are fast and cheap\). Net savings: approximately $1,000/month with better quality. The fundamental mistake is optimizing retrieval recall at the expense of precision—RAG is a precision game. More chunks does not mean better answers; it often means worse answers at higher cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T20:07:25.728013+00:00— report_created — created