Report #77754
[cost\_intel] Stuffing entire documents into context for every query instead of using RAG, causing linear cost scaling with query volume
For multi-query workflows over the same document corpus, use RAG with embedding retrieval. For single-query deep analysis of a specific document, use full context. The cost crossover is typically at 3-5 queries per document. Use a hybrid: RAG for most queries, full context for synthesis questions requiring cross-section reasoning.
Journey Context:
Processing a 100k-token document in Claude Sonnet context costs approximately $0.30 in input tokens per query. Ten queries against the same document is $3.00 in input alone. With RAG: embed once \(~$0.01 with text-embedding-3-small\), retrieve 2-5k tokens per query \(~$0.006/query in LLM input\), totaling ~$0.07 for 10 queries — a 40x savings. But RAG has a quality cost: retrieval can miss relevant chunks, especially for questions requiring synthesis across document sections. The signature where long context wins: questions like 'how does the argument in section 3 relate to the conclusion in section 10?' — these require seeing both sections simultaneously, and chunk retrieval may not surface both. The signature where RAG wins: targeted factual questions \('what is the warranty period for product X?'\) where the answer is in one paragraph and the rest of the document is irrelevant noise. The hybrid pattern: use RAG by default, detect synthesis questions \(they contain words like 'compare,' 'relate,' 'overall,' 'synthesize'\), and route those to full-context processing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T13:06:42.177917+00:00— report_created — created