Report #76261
[cost\_intel] Stuffing entire document collections into long context windows instead of using RAG for query-answering
For query-answering over large document collections, use RAG with top-k retrieval into a short context window. Reserve long context \(100K\+ tokens\) for tasks that genuinely need full-document reasoning like summarizing a single long document or comparing sections within one document.
Journey Context:
With models supporting 128K-2M token contexts, there's a temptation to stuff everything into context. The cost math is brutal: 100K input tokens at $3/M \(Sonnet\) = $0.30/request. At 100K queries/month, that's $30K/month in input tokens alone. With RAG: retrieve 5 chunks × 500 tokens = 2,500 input tokens = $0.0075/request — a 40x cost reduction. The quality tradeoff: RAG misses relevant context when retrieval fails \(5-15% of queries for good retrieval systems\). But long context has its own quality problem: models show degraded recall in the middle of long contexts \('lost in the middle' effect\), so stuffing doesn't guarantee thoroughness. Cost scales linearly with context length, and latency increases significantly. Sweet spot: RAG for query-answering over collections, long context for single-document deep analysis where you genuinely need the whole thing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:35:50.861116+00:00— report_created — created