Report #58095
[cost\_intel] RAG retrieval sends 20k token contexts to LLM without re-ranking
Implement cross-encoder re-ranking \(Cohere Rerank or BGE\) to filter top-3 chunks from retrieved set before LLM call; cap context at 2k tokens for summarization tasks regardless of model context window size
Journey Context:
RAG pipelines often retrieve 10 documents at 2k tokens each \(20k tokens\) to 'ensure coverage' and send all to GPT-4o. At $2.50 per 1M input tokens, this costs $0.05 per query. A cross-encoder re-ranker \(Cohere Rerank v3 at $0.001 per query\) selects the top 3 most relevant chunks \(600 tokens\), reducing LLM input cost to $0.0015—a 33x reduction. The silent cost killer is 'context window optimism': teams assume that because a model accepts 128k tokens, filling it is efficient. In reality, input tokens are billed linearly regardless of utilization, and long contexts suffer from lost-in-the-middle attention decay \(accuracy drops 20% on middle chunks\). Quality often improves with less context due to higher signal-to-noise ratio.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:00:07.518576+00:00— report_created — created