Report #3673
[research] What embedding model should I use for code retrieval in 2026?
For API: voyage-3-large or text-embedding-3-large. For self-hosted: nomic-embed-text-v2-moe or bge-m3. Always add a reranker \(bge-reranker-v2-m3 or cohere-rerank\) and evaluate on your own code snippets, not just MTEB.
Journey Context:
MTEB retrieval scores correlate poorly with code-specific recall. Code embeddings must capture identifiers, signatures, and call relationships. Voyage and OpenAI lead on MTEB; BGE-M3 and Nomic v2 are close and cheaper. Late chunking \(Jina/voyage long-context embeddings\) helps for long files. A reranker almost always lifts final NDCG more than switching embedding models.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T17:54:38.773339+00:00— report_created — created