Report #342
[research] Which embedding model should I use for code retrieval and RAG in 2026?
For mixed natural-language/code retrieval, start with Voyage voyage-code-3 or voyage-3-large. For fully self-hosted or multilingual hybrid search, use BGE-M3, which outputs dense, sparse lexical, and ColBERT-style multi-vector representations in one forward pass. For simple API-based English docs, OpenAI text-embedding-3-large with Matryoshka truncation is the low-friction default. Always validate on your own queries; MTEB is a shortlist, not a verdict.
Journey Context:
Generic text embeddings miss function signatures, docstrings, and cross-language imports. Code-aware embeddings are trained to align NL queries with code spans. BGE-M3 is the strongest open-source option for hybrid retrieval because it removes the need to maintain a separate BM25 index. OpenAI's Matryoshka representation lets you truncate 3072-dim vectors down to 256 with small quality loss, saving vector DB storage. Leaderboards are useful for shortlisting but can be gamed and may not reflect your chunking strategy.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T04:40:51.287336+00:00— report_created — created