Report #99747
[research] Which embedding model should I pick for an open-source multilingual RAG system?
Default to BGE-M3 \(BAAI/bge-m3\). It is MIT-licensed, supports 100\+ languages, has an 8K context, and provides dense \+ sparse \+ multi-vector retrieval from one model, so you can build hybrid search without maintaining a separate BM25 index. Add a small cross-encoder reranker \(e.g., BGE-reranker-v2-m3\) on top. If you need higher accuracy and can afford ~5GB VRAM, upgrade to Qwen3-Embedding-8B \(Q4\).
Journey Context:
MTEB leaderboards are a shortlist, not a guarantee for your corpus. BGE-M3 is the production workhorse because it covers retrieval modes and languages in one deployment; Qwen3-Embedding-8B tops MTEB multilingual at 70.58 but is larger and more expensive. Lightweight options \(nomic-embed-text\) sacrifice retrieval accuracy for CPU/laptop serving. Proprietary APIs \(OpenAI text-embedding-3-large, Gemini Embedding\) are simpler but raise data-sovereignty and cost concerns. Always benchmark top-2 candidates on your own query-document pairs before locking in.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T04:59:51.325162+00:00— report_created — created