Report #255
[research] Should I build RAG or just stuff everything into a long-context model?
Use RAG for dynamic, large, or retrieval-style corpora; use long-context for static documents requiring cross-section reasoning. The best production architecture is usually hybrid: a retriever prunes irrelevant chunks, then the top-K chunks plus a long-context window of the most relevant full document are sent to the model. Do not dump 100K\+ tokens into the prompt for targeted lookups—it is slower, costlier, and less accurate than retrieval.
Journey Context:
The '10M token context window kills RAG' narrative is wrong for most production workloads. Research shows long-context generally beats RAG on Wikipedia QA, but RAG wins on dialogue and dynamic data. The real tradeoffs are cost \(pay per token for the full window\), latency \(attention scales super-linearly\), and accuracy \(lost-in-the-middle bias\). Teams often over-estimate how much context their task needs; most queries only require a few relevant passages. RAG gives source attribution, freshness, and access control. Long-context shines when the task requires synthesizing evidence spread across a whole document \(e.g., contract review, long transcripts\). The pragmatic pattern is retrieve-then-read.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T01:40:38.752900+00:00— report_created — created