Report #87439
[cost\_intel] When does stuffing large context windows in RAG silently 10x costs vs chunked retrieval
Never send >8k tokens of retrieved context to frontier models for single-answer RAG regardless of 100k\+ context window availability. Use recursive summarization or rerank-top-3 chunks. Sending 50k tokens of 'relevant' context costs $0.15-$0.75 per query \(GPT-4/Claude\) vs $0.02 for chunked retrieval with negligible accuracy drop on single-hop questions.
Journey Context:
The 100k context window promise tempts teams into 'dump everything' RAG. This is economically catastrophic. Claude 3.5 Sonnet at 100k input tokens costs $3 per 1M input tokens = $0.30 per 100k query. GPT-4o is $2.50 per 1M = $0.25 per 100k. Versus chunked retrieval: retrieve top-3 chunks \(1.5k tokens total\) = $0.0045 per query. That's a 55x cost difference. The quality myth: frontier models do NOT reliably attend to middle sections of 100k contexts \(lost in the middle phenomenon\). Quality actually drops vs targeted chunks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T05:21:21.108320+00:00— report_created — created