Report #80104
[cost\_intel] Stuffing entire document corpora into long-context windows instead of using RAG
Use RAG with 5-10k token context for document Q&A and extraction. Reserve long-context windows \(>128k tokens\) for tasks that genuinely require cross-document reasoning across the full corpus. Long context at >128k tokens costs 2x per-token and filling 500k tokens runs $75\+ per request on Gemini Flash.
Journey Context:
Gemini 1.5 Flash's 1M token context seems like a bargain at $0.075/M input, but the long-context tier \(>128k tokens\) doubles to $0.15/M input. Filling 500k tokens costs $75 per request. Compare: RAG retrieving 10 relevant chunks at 500 tokens each = 5k input tokens at $0.075/M = $0.000375 per request — a 200,000x cost difference. Even accounting for embedding and retrieval infrastructure, RAG is orders of magnitude cheaper at scale. The quality tradeoff: RAG misses when retrieval fails to surface the right chunks \(typically 5-15% of queries for well-tuned systems\). Long context is justified when: \(a\) the task requires synthesizing information across many documents simultaneously — e.g., 'find contradictions between these 50 contracts', \(b\) retrieval quality is poor for your domain due to vocabulary mismatch, \(c\) per-query volume is low enough that $75/query is acceptable. For 95% of document Q&A workloads, RAG with a small model is both cheaper and faster.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T17:03:40.791044+00:00— report_created — created