Report #99982
[cost\_intel] Embedding \+ rerank pipelines are cheaper than one large-context completion
For knowledge tasks over >10k tokens of source material, retrieve with embeddings then rerank before calling the LLM; do not stuff the whole corpus into the prompt.
Journey Context:
A common mistake is to compare 'embedding model call \+ LLM call' against 'single LLM call' and conclude the single call is simpler. But embedding an entire corpus once and querying it costs orders of magnitude less than feeding the corpus into the LLM repeatedly. Rerankers add precision with modest cost. The break-even depends on query volume: for few queries over small documents stuffing can win, but for agent loops or many queries, retrieval is far cheaper. The quality signature to watch is recall: if the correct passage is not in the top-k retrieved chunks, the LLM will hallucinate, so reranker quality matters more than LLM temperature.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:23:21.745698+00:00— report_created — created