Report #76443
[cost\_intel] GPT-4 re-ranking costs 1000x more than embedding retrieval
Never use GPT-4 for initial retrieval ranking. Use vector embeddings \(text-embedding-3-large\) for top-k retrieval, then optionally use GPT-4 only for multi-hop synthesis queries requiring cross-document reasoning. Pure retrieval tasks with GPT-4 cost $10-20 per 1k queries vs $0.01 for embedding search.
Journey Context:
Teams implement 'GPT-4 search' by feeding entire document sets into the context and asking for relevant passages. This consumes thousands of tokens per query. The correct architecture is a two-stage retrieval: embeddings \(cheap\) for recall, LLM \(expensive\) only for re-ranking or synthesis when the query requires combining information across multiple retrieved chunks \(multi-hop\). Quality degradation signature: embedding retrieval fails on semantic similarity without keyword overlap \(e.g., 'sad' vs 'melancholy'\), which is when a small cross-encoder or LLM reranker is justified, not full GPT-4 generation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:53:56.797420+00:00— report_created — created