Report #2539
[research] Which embedding model should I use for code/RAG in 2025?
For multilingual retrieval at scale, use Gemini Embedding or the open Qwen3-Embedding-8B \(Apache 2.0\). For small, self-hosted retrieval, use BGE-M3 \(dense\+sparse\+multi-vector, 100\+ languages\) or Jina v3/v4. For English-only budget cases, Nomic Embed v1.5 or all-MiniLM-L6-v2 are still reasonable. Always benchmark on your own data; MTEB rankings do not guarantee domain performance.
Journey Context:
The field moved beyond default sentence-transformers. BGE-M3's multi-granularity \(including Colbert-style late interaction\) helps with long documents and lexical matching. Gemini Embedding leads public multilingual MTEB retrieval, while Qwen3-Embedding offers a strong open alternative. The trap is assuming one embedding model fits all domains; code, legal, and medical retrieval often gain \+10-30% recall from domain-specific fine-tuning. Dimensionality tradeoffs matter too: Matryoshka embeddings let you shrink dimensions for storage and speed with small accuracy loss.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T12:53:22.325456+00:00— report_created — created