Report #42380
[cost\_intel] Unoptimized vector retrieval causing unnecessary embedding model costs
Implement two-stage retrieval: use text-embedding-3-small \(1536-dim\) to retrieve top-100 candidates, then rerank with cross-encoder \(ms-marco-MiniLM\); achieve 95% of large-embedding recall at 20% of the cost
Journey Context:
Teams use text-embedding-3-large \(3072-dim, $0.13/1M tokens\) believing it's necessary for good RAG. However, large embeddings primarily help with fine-grained semantic distinction across millions of documents. For datasets <100k chunks, small embeddings \(1536-dim, $0.02/1M\) \+ a reranker \(cross-encoder\) outperform large embeddings alone. Cost: small embedding = $0.02/1M tokens, large = $0.13/1M. Reranker \(MiniLM\) is local/free or cheap API. Retrieve top-100 with small embed, rerank top-20 with cross-encoder. Accuracy is within 2-3% of large embeddings at 1/5th the cost. Provenance: SBERT documentation on retrieve-and-rerank patterns. Pitfall: using large embeddings for the initial retrieval and then reranking, which wastes money; always use the cheap model for candidate generation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T01:36:26.845998+00:00— report_created — created