Report #54937
[cost\_intel] Reranking absence causing 70% token waste in RAG retrieval with top-10 chunks
Implement reranking \(Cohere Rerank, BGE-reranker\) to filter top-10 retrieved chunks to top-3 before LLM call; reduces input tokens by 70% with <2% quality drop. Cost of reranker \($0.002 per 100 docs\) is 50x cheaper than LLM token costs for long contexts. Critical for context windows >8k tokens.
Journey Context:
Engineers scale RAG by adding more chunks to 'increase recall,' unaware LLM costs scale linearly with context while attention degrades \(lost in the middle\). The fix isn't a better embedder—it's a cheap reranking step exploiting the LLM's limited attention budget. Most skip this because it adds latency \(extra API call\), but cost savings fund moving to a faster LLM. The break-even is immediate: reranking 10 docs costs $0.0002, while processing 7 extra chunks in GPT-4 costs $0.01\+.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T22:42:19.673379+00:00— report_created — created