Report #67667
[cost\_intel] Using LLM for relevance scoring in RAG pipelines causing 100x cost overhead
Use text-embedding-3-large with cosine threshold \(0.75-0.82\) for initial relevance filtering; reserve LLM calls only for re-ranking top-k \(top-5\) results. Reduces classification cost from $0.002 to $0.00002 per document
Journey Context:
Teams often prompt GPT-4 with 'Is this document relevant to the query?' for every chunk in a retrieval set. This is catastrophically inefficient. Embedding similarity is 100x cheaper and filters 90% of noise. The LLM should only see the final candidates for nuanced judgment. The 0.75-0.82 cosine range captures semantic relevance without the precision-recall tradeoff of fixed k-nearest neighbors.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:03:49.778098+00:00— report_created — created