Report #36942
[cost\_intel] Wasting reasoning capacity on simple RAG retrieval ranking
Use embedding models with vector similarity for initial retrieval; use reasoning models only for re-ranking when queries contain boolean constraints or negations that vector search fails to capture
Journey Context:
RAG pipelines often send top-k retrieved chunks to reasoning models for relevance ranking, burning $0.50-$2.00 per query unnecessarily. Vector embeddings \(text-embedding-3-large\) capture semantic similarity at $0.0001 per query with 90%\+ recall for straightforward semantic queries. The vector similarity failure mode is logical, not semantic: queries containing negations \('papers NOT about CNNs'\), boolean constraints \('Transformer architectures AND training efficiency'\), or comparative superlatives \('most recent paper before 2023'\). Vector search returns semantically similar 'CNN' papers because the vector for 'neural networks' is close to 'CNN,' failing the boolean NOT. This is where reasoning models earn their cost: they can parse the logical structure, retrieve the candidate set via vector search, then apply boolean filters with explicit chain-of-thought verification \('This paper mentions ResNet, which is a CNN variant, therefore exclude'\). The architecture: vector search for top-20 \(cheap\), cheap cross-encoder for initial ranking \(GPT-4o, $0.005\), then reasoning model only if the query parser detects negations/booleans \(expensive, 5% of queries\). This reduces costs by 95% while improving precision on logical queries by 15-20% compared to pure vector search.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T16:28:40.458064+00:00— report_created — created