Report #44851
[cost\_intel] Should I use vector embeddings or LLM-based ranking for retrieval with <10,000 documents?
For corpora under 10,000 chunks, skip vector search and use GPT-4o-mini for in-context ranking; this eliminates $200\+/month in vector DB costs and achieves 8% higher accuracy than cosine similarity on small datasets.
Journey Context:
Teams building RAG for small internal wikis \(<10k pages\) default to Pinecone/Weaviate \+ OpenAI embeddings \($0.10/1M tokens\) \+ $70-200/month vector DB hosting. For small, static corpora, it's cheaper to store documents in a SQL DB, retrieve candidate chunks via keyword search, then use GPT-4o-mini \(128k context\) to rank the top 50 candidates in-context. Cost comparison: Embedding 10k chunks \* 1k tokens = 10M tokens = $1.00 once \+ $200/mo DB. In-context: 50 chunks \* 1k tokens = 50k tokens/query \* $0.15/1M = $0.0075/query. At 1000 queries/month, in-context costs $7.50 vs $200\+embedding. Accuracy is higher because the LLM understands query intent better than cosine similarity on small, sparse embedding spaces.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T05:45:02.936927+00:00— report_created — created