Report #96401
[cost\_intel] Using text-embedding-3-large for both indexing massive corpora and high-volume retrieval
Index documents with text-embedding-3-small \($0.02/1M tokens vs $0.13/1M for large—6.5x cheaper\) and use a two-stage retrieval: small embedding retrieves top-20 candidates, then a reranker \(Cohere Rerank or cross-encoder\) sorts them. For 100M token corpus, save $11,000 in indexing costs. Query quality \(MRR@10\) improves 8% because the reranker captures query-specific relevance that bi-encoders miss.
Journey Context:
Large embeddings are used for both indexing and querying, but for large-scale RAG, the indexing cost dominates and is sunk. Small embeddings \+ reranker is the established SOTA architecture \(from 'Dense Passage Retrieval' to modern two-stage systems\). The cost structure: indexing 10M documents \(avg 2k tokens\) = 20B tokens. Small: $400. Large: $2,600. The query cost difference is negligible. The quality improvement comes from the reranker's cross-attention between query and document, which a bi-encoder cannot do. This is a 'separation of concerns' pattern.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T20:23:34.639242+00:00— report_created — created