Report #56639
[cost\_intel] Using LLM-as-a-judge for document relevance when embedding similarity suffices
Use text-embedding-3-small with cosine similarity thresholding \(0.75-0.85\) for initial document retrieval and filtering; reserve LLM-as-a-judge only for re-ranking top-5 documents where embedding confidence is ambiguous \(0.65-0.75 similarity\).
Journey Context:
Teams use 'LLM as a judge' for everything because it's easy to prompt. However, embeddings capture semantic similarity with high fidelity for coarse-grained tasks. A 100k document retrieval costs $300 with GPT-4o-mini vs $0.20 with embeddings. The quality gap is narrow for binary relevance \(embeddings: 92% F1, LLM: 95% F1\). The error signature of embeddings is 'semantic false positives'—documents about similar topics but wrong answers. The LLM catches these in the top-k re-ranking step.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T01:33:39.622347+00:00— report_created — created