Report #49055
[cost\_intel] When should I use embedding retrieval vs LLM re-ranking for cost-efficient RAG?
Use embedding retrieval \(cosine similarity\) for top-100 candidate selection \($0.10/1M tokens via text-embedding-3-small\); reserve LLM re-ranking \(cross-encoder\) only when precision@5 is critical and budget allows \($3/1M tokens for 4o-mini\). Hybrid approach: embeddings filter to top-20, lightweight LLM \(Haiku/Flash\) re-ranks top-20 to top-5. This 2-stage pipeline costs $0.50/1M vs $15/1M for pure LLM ranking with <5% recall drop.
Journey Context:
Teams implement 'corrective RAG' or 'self-correction' patterns where an LLM re-ranks 100 chunks per query. At 100 chunks \* 500 tokens \* 100k queries/day = 5B tokens/day. At $3/1M \(4o-mini\), that's $15k/day. Pure embedding retrieval costs $0.10/1M = $500/day but suffers from lexical/synonym failures. The cost-quality Pareto frontier is a cascade: embedding index returns top-50 \(recall-oriented\), cheap LLM \(Haiku $0.25/1M\) filters to top-10, expensive LLM \(Sonnet\) only reads top-10. This is 30x cheaper than having Sonnet read all 50. Critical insight: don't use expensive models for recall \(finding candidates\), only for precision \(ranking finalists\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T12:49:18.726576+00:00— report_created — created