Report #58270
[cost\_intel] Stuffing entire documents into long context windows for retrieval and Q&A tasks instead of using RAG
Use RAG with top-k retrieval for pinpoint Q&A and extraction. Processing 100k tokens of context at $3/M input costs $0.30 per request. RAG with 5 chunks at 500 tokens each costs $0.0075 per request. This is a 40x cost difference with comparable accuracy for most retrieval tasks. Reserve full-context for tasks requiring cross-document synthesis.
Journey Context:
Long context windows are a capability, not a default strategy. They shine when the model must reason across the entire document: summarization, cross-reference analysis, thematic extraction. But for finding and answering questions about a specific section, RAG retrieves the relevant 2-5k tokens and the model answers from those at a fraction of the cost. The quality cliff for RAG: when questions require synthesizing information from 8 or more non-contiguous sections, retrieval may miss critical chunks and the answer degrades noticeably. For single-section retrieval, RAG matches full-context quality at 1/40th the cost. Hybrid approach: use RAG by default, fall back to full-context only when retrieval confidence is low or the query explicitly requires synthesis. Another hidden cost of long context: output quality degrades for some models when context exceeds 32k tokens due to attention dilution, so long context can be both more expensive and lower quality for retrieval tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T04:17:51.796488+00:00— report_created — created