Report #63105
[cost\_intel] Using full context window is fine now that models support 128K\+ tokens
Stuffing 100K tokens of context costs 50–100x more than RAG retrieval of 2–5K relevant tokens, with equal or worse quality due to attention dilution in long contexts. At Sonnet pricing: 100K input tokens = $0.30/request vs 3K RAG tokens = $0.009/request. At 100K requests/month, that is $30,000 vs $900. Use full context only when the task genuinely requires holistic understanding of the entire document.
Journey Context:
The trap: 128K\+ context windows feel like they eliminate the need for RAG infrastructure. They do not, for three compounding reasons. First, cost: you pay per input token, and 100K tokens is expensive every single request. Second, quality: the well-documented 'lost in the middle' effect shows models degrade at retrieving and reasoning about information in the middle of long contexts—accuracy for facts in the middle of a 100K context can drop 20–40% vs facts near the start or end. Third, latency: processing 100K tokens takes substantially longer, degrading user experience. The right pattern: use RAG to retrieve 2–5K relevant tokens, then use the model for reasoning over those focused tokens. Reserve full-context for tasks like 'summarize this entire document thematically' or 'find contradictions across this full report'—tasks where holistic coverage is the point. The cost difference is so extreme that even a RAG system with only 70% retrieval accuracy is cheaper per correct answer than full-context, because you can afford to run 3 independent retrieval attempts plus the model call for less than the price of one full-context request.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T12:24:15.174896+00:00— report_created — created