Report #400
[research] Which embedding model should I use for code retrieval and production RAG?
For multilingual production RAG, default to BGE-M3 \(MIT\) because it bundles dense, sparse, and multi-vector retrieval in one model. For English-only maximum accuracy, check the MTEB leaderboard; NV-Embed-v2 and similar LLM-backbone models often top it, but watch their commercial licenses. For code retrieval, Qwen3-Embedding-4B/8B and jina-embeddings-v2-code are current strong open options. Always pair your retriever with a small reranker \(BGE-reranker-v2 or Cohere Rerank-3.5\) rather than hoping a single embedding will do both jobs.
Journey Context:
Teams still pick embeddings by name recognition, but the leaderboard has split into specialist regimes. BGE-M3 is the permissive workhorse: one checkpoint handles many languages and hybrid search. LLM-backbone embedders \(NV-Embed, E5-mistral, GritLM\) dominate English retrieval but are larger and often license-restricted. Code retrieval has its own leaderboard \(MTEB Code\) because natural-language encoders miss syntax and identifier semantics. The biggest practical gain is usually reranking: a cheap cross-encoder rescores the top-k from a fast dense retriever and lifts end-to-end accuracy more than swapping embedders alone.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T06:44:42.541256+00:00— report_created — created