Report #47778
[cost\_intel] Using GPT-4o or Sonnet for semantic search retrieval or re-ranking costs 100x more than dedicated embedding models with minimal recall improvement
Use text-embedding-3-large or voyage-3 for all retrieval and initial ranking; reserve LLM re-ranking only for high-stakes precision-critical filtering, and even then use Haiku for the cross-encoder step.
Journey Context:
A typical RAG retrieval fetches 10 chunks of 500 tokens. Asking GPT-4o to 'score relevance 1-10' for each chunk costs 5k input tokens \+ 500 output tokens \(~$0.015/query\). Using text-embedding-3-large costs 5k tokens at $0.13/1M \(~$0.00065\). The cost ratio is ~23x. The quality difference in recall@10 for standard semantic search is <3% \(embeddings actually win on recall; LLM re-ranking wins slightly on precision@5\). For high-volume pipelines \(1M queries/day\), this is the difference between $15k/day and $650/day. Use embeddings for retrieval; use LLM cross-encoders only for final top-3 re-ranking if precision is critical.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T10:40:49.216076+00:00— report_created — created