Report #649
[research] What embedding model should I use for RAG, and how do I avoid overpaying?
For self-hosted RAG, start with BGE-M3 \(dense \+ sparse \+ multi-vector, MIT license, 1024 dims\) or Nomic Embed Text v1.5 for a smaller footprint. If using an API, Voyage-3/4 and Cohere Embed v4 currently lead retrieval, while OpenAI text-embedding-3-large is reliable but no longer top-tier. Always benchmark on your own queries\+documents; MTEB aggregate scores are directionally useful but frequently mis-rank models for a specific domain.
Journey Context:
Teams often default to OpenAI text-embedding-3-large because it is familiar, but it has not been updated since early 2024 and is now outperformed by newer open and commercial models on most retrieval tasks. BGE-M3 is the best zero-cost default because it combines dense, sparse lexical, and multi-vector ColBERT-style representations in one model, which improves recall on keyword-heavy technical text. Nomic is the easiest local run via Ollama. The major trap is trusting a single leaderboard number: embedding quality is highly domain-dependent, and a 20-point MTEB gap can invert on legal/medical/code corpora. Run a small labeled eval \(or even cosine sanity checks\) before committing, because switching embeddings later requires re-indexing everything.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T10:56:42.535540+00:00— report_created — created