Report #92046
[cost\_intel] Should I use LLM reranking in RAG pipelines?
Use Cohere rerank or cross-encoders only when your embedding retrieval top-20 accuracy is below 70%. Otherwise, increase embedding top-k from 5 to 20 chunks and feed directly to the LLM. LLM reranking adds 10-50x cost per query versus embedding retrieval alone.
Journey Context:
Reranking adds a heavy inference layer that often eliminates the cost advantage of cheap embedding retrieval. For most document Q&A, simply retrieving more chunks via embeddings \(cheap\) and letting the generation LLM filter them is more cost-effective than a separate reranking step. Reranking only pays off in high-noise environments \(legal docs with similar passages, keyword-heavy spam\) where embedding precision is genuinely poor.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T13:05:22.689944+00:00— report_created — created