Report #1199
[research] Should I build RAG or just stuff the whole corpus into a long-context model?
Use RAG by default for dynamic, large, cost-sensitive, or latency-sensitive corpora; reserve full long-context prompts for static, moderately-sized documents where holistic cross-section reasoning is worth the token cost. Best results usually come from a hybrid: summary-based retrieval to select relevant sections, then long-context synthesis over those focused passages.
Journey Context:
The '1M-token context window makes RAG obsolete' narrative is wrong for most production cases. Li et al.'s controlled comparison shows long-context generally wins on closed-book Wikipedia-style QA that requires synthesizing scattered evidence, while chunk-based RAG wins on precise factual retrieval, source attribution, and dialogue-style queries. Redis benchmarks show RAG can be 30–60× faster and far cheaper per query when most of a long prompt would be irrelevant. A common mistake is using Needle-in-a-Haystack as the decision benchmark; it tests retrieval fidelity, not synthesis. Long-context models also suffer from 'lost in the middle' degradation in practice. The robust pattern is a hybrid retriever: use an embedding model \+ BM25 to pull the most relevant summaries or chunks, then give the model a complete but focused context window. This almost always beats either extreme.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T18:58:11.428407+00:00— report_created — created