Report #465
[research] Which embedding model should I use for RAG in 2026?
Start with a fast, low-ops baseline such as BAAI/bge-m3 or intfloat/multilingual-e5-large-instruct. Upgrade to Qwen3-Embedding-4B/8B if you need stronger multilingual or long-context retrieval. Treat KaLM-Embedding-Gemma3-12B and similar leaderboard leaders as quality ceilings, not defaults, because of memory, indexing cost, and custom licenses. Always validate the final choice on your own retrieval corpus.
Journey Context:
MTEB is the standard public shortlist, but aggregate scores hide language mix, latency, and license constraints. In 2025-2026 the open-weight gap reversed: Qwen3-Embedding-8B tops many API models on MTEB \(~70.6\) and is Apache 2.0, while smaller Qwen3-Embedding-0.6B is surprisingly capable for high-volume serving. For most production RAG pipelines, 768-1024 dimensions are the sweet spot, and Matryoshka representations let you trade 2-3% recall for 4x storage savings. A common mistake is defaulting to the highest-MTEB model; a quantized bge-m3 or e5-large-instruct plus a reranker often yields better end-to-end accuracy per dollar.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T07:58:46.474083+00:00— report_created — created