Report #802
[research] Which embedding model should I use for code search and technical RAG in 2026?
For code and technical docs, prefer a code-specific embedding: Voyage AI voyage-code-3 is the strongest API option for code retrieval \(outperforms OpenAI text-embedding-3-large by ~13.8% on code retrieval suites and supports 32K context plus Matryoshka dimensions\). If you must self-host, use Alibaba Qwen3-Embedding-8B for best overall code \+ multilingual retrieval, or BGE-M3 for a single-model hybrid dense/sparse setup under an MIT license. For CPU/edge local inference, use Nomic Embed v2. Avoid defaulting to OpenAI text-embedding-3-large for code — it is a safe generalist but not top-tier on technical retrieval.
Journey Context:
General-purpose embeddings understand natural language well but miss syntax, naming conventions, and code structure. Code-specific embeddings are trained to align text-to-code, code-to-code, and docstring-to-code, which is what a coding agent actually needs. Voyage-code-3 is the current retrieval specialist, with lower-dimensional and quantized variants to cut storage. Open-source options caught up in 2025–2026: Qwen3-Embedding-8B tops the MTEB average, and BGE-M3 packs dense, sparse, and multi-vector retrieval into one small model. Dimensionality matters: 768–1024 dimensions usually give the best precision/cost tradeoff, and Matryoshka-capable models \(OpenAI, Cohere, Jina, Voyage\) let you reduce dimensions with only 2–3% loss. If your data is multilingual, Qwen3-Embedding and Google Gemini Embedding lead on non-English European and Asian languages.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T12:58:35.784690+00:00— report_created — created