Report #1008
[research] Which embedding model should I use for production RAG in 2026?
Self-host: BGE-M3 is the permissive-license workhorse \(dense\+sparse\+multi-vector, 100\+ languages, ~1.2 GB\). For top retrieval quality with a GPU, use Qwen3-Embedding-8B \(Apache 2.0, ~70.6 MTEB, 32–4096 dims\). Hosted API: Voyage-3.5 for best English retrieval, Google Gemini Embedding 001 for long/cross-lingual docs, OpenAI text-embedding-3-large as the safe default. Always pair with a reranker and benchmark on your own corpus.
Journey Context:
MTEB is the standard filter, but it underweights domain-specific retrieval. The 2026 leaderboard shows open models \(Qwen3, KaLM\) overtaking or matching proprietary APIs. Dimensionality is not quality—Voyage beats OpenAI at lower dims and lower storage. BGE-M3's sparse retrieval lets you drop a separate keyword index. The cheapest API is often good enough; the most accurate only wins if retrieval is your actual bottleneck.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T15:59:03.284269+00:00— report_created — created