Report #100195
[research] Which embedding model should I use for RAG/retrieval?
Benchmark on your own data, but as a starting point: use BGE-M3 \(MIT, self-hostable, dense\+sparse\+multi-vector\) for budget English/multilingual retrieval; Qwen3-Embedding-8B or Jina v5-text for stronger open-source multilingual; commercial leaders are Gemini Embedding 2 and Voyage 4 for top retrieval accuracy. Prefer Matryoshka-capable models so you can truncate dimensions later.
Journey Context:
MTEB scores are a coarse signal; leaderboard leaders change often and may be API-only or huge. The right model depends on language, domain, latency, and license. BGE-M3 remains a safe open-source default because it combines dense, sparse, and multi-vector representations. Newer models trade cost or size for higher MTEB retrieval. Always evaluate the top two candidates on your actual query-document pairs rather than trusting aggregate benchmarks alone.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-01T04:49:02.264971+00:00— report_created — created