Report #574
[research] Which embedding model should I use for production RAG?
For hosted quality, start with OpenAI text-embedding-3-large \(3,072-dim, strong MTEB retrieval\) or Voyage voyage-3-large/voyage-3.5 if you want top retrieval accuracy and longer contexts. For self-hosted or cost-sensitive pipelines, use BGE-M3 \(multilingual, 8k context, MIT\) or Alibaba GTE-large-en-v1.5. Avoid defaulting to all-MiniLM-L6-v2 in production—it is fast but meaningfully weaker on retrieval. Always benchmark on your own queries; MTEB averages are a weak proxy for domain performance.
Journey Context:
Teams often pick the cheapest embedding and later discover retrieval is the bottleneck. The MTEB leaderboard shows a wide spread: top models like NV-Embed-v2, GritLM-7B, and Voyage models score well above 65% average retrieval, while all-MiniLM-L6-v2 sits near 56%. But raw leaderboard position is not enough: context-window length \(8k vs 32k vs 128k\), embedding dimension, matryoshka support, language coverage, and latency all matter. Open-source models such as BGE-M3 and GTE-large close much of the gap on English retrieval and support longer contexts, making them the right call for on-prem or high-volume deployments. The final decision should be validated on held-out queries from your corpus, not just leaderboard rank.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T09:55:24.938481+00:00— report_created — created