Report #72382
[cost\_intel] Stuffing 100K\+ token documents into context for every query when RAG retrieval matches quality at 1/100th the cost
Use top-K RAG retrieval \(K=5-10 chunks, ~500-1000 tokens each\) for Q&A and extraction over large documents. Only use full-context when the task requires synthesizing information across distant sections or the query scope is genuinely unknown a priori.
Journey Context:
Processing a 128K-token document on Sonnet costs ~$0.384 per query \(input only\). A top-5 RAG retrieval over 500-token chunks costs ~$0.002 per query — 192x cheaper. On standard long-context QA benchmarks \(NarrativeQA, QuALITY\), RAG with good retrieval matches full-context quality within 2-5% for factoid and extractive questions. Full-context wins significantly only on questions requiring cross-document synthesis \('compare the revenue trend in section 3 with the risk factors in section 7'\). The anti-pattern: teams adopt 200K context windows and stop investing in retrieval quality, then wonder why costs exploded. The right architecture: invest in chunking strategy and embedding quality first, use full-context as a fallback for the 5-10% of queries that genuinely need it, and cache the full document for multi-turn conversations over the same source.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T04:04:52.884892+00:00— report_created — created