Report #99083
[cost\_intel] Full-context LLM is used for document Q&A where embeddings \+ reranker would be cheaper and more accurate
For question-answering over large corpora, embed chunks with a cheap model like text-embedding-3-small \($0.02/M tokens\), retrieve top-K, rerank, and send only the top chunks to the LLM. Use full-context models only when the task requires holistic synthesis across the entire document.
Journey Context:
Embedding 1M tokens costs $0.02; a single 128K-token GPT-4o request costs ~$3.50 in input alone. RAG with reranking is usually orders of magnitude cheaper and avoids lost-in-the-middle degradation. The failure signature of under-retrieval is questions that require combining evidence from multiple distant chunks. Fix that with larger windows, hierarchical summaries, or hybrid search rather than defaulting to full-document stuffing. The trap is using long-context models for every query because the context window fits.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-28T05:16:37.401405+00:00— report_created — created