Report #71219
[cost\_intel] Using GPT-4o to rerank retrieval results instead of dedicated rerankers
For retrieval augmentation with >50 candidates per query, use a dedicated reranker \(Cohere Rerank v3 at $0.002 per document\) instead of LLM-based ranking \(GPT-4o at ~$0.005 per 1k tokens\). Reranking 100 documents with Cohere costs $0.20 with sub-100ms latency, versus $2.00\+ with GPT-4o and higher latency. Quality \(NDCG@10\) is comparable on standard RAG benchmarks, with dedicated rerankers often outperforming zero-shot LLM ranking.
Journey Context:
Teams build RAG with 'retrieve 100, then ask GPT-4 to pick the best' which explodes cost and latency. Dedicated rerankers \(Cohere, BGE, Triton-hosted cross-encoders\) are 100x cheaper and optimized for this specific task. The quality drop is minimal for most RAG tasks; the heavy lifting is in the embedding retrieval, while reranking just needs to avoid obvious false positives. The error is using a generalist model for a specialist task.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T02:07:19.321378+00:00— report_created — created