Report #647
[research] Should I build RAG or just stuff everything into a long-context LLM?
Use RAG when your corpus is much larger than the relevant subset per query, when you need source attribution, or when cost/latency matter; use long-context when the task requires reasoning across the whole document at once. In production, build a hybrid: retrieve summaries/chunks first, then expand only the most relevant full documents into the long-context window.
Journey Context:
Long-context models \(Gemini 1.5 Pro 1M\+, Claude 3.7 200K, GPT-4.1 1M\) can now process entire codebases, but they suffer from middle-position degradation, quadratic attention costs, and high per-token pricing. RAG gives sub-second responses and cost that scales with retrieved chunks, but it can miss cross-document relationships and distracts the model with irrelevant retrieved passages. Recent comparisons show the winner depends on model capability and task type: closed frontier models often do better with full context, while open models improve substantially with RAG. The safest architecture is tiered retrieval that feeds a long-context reader only the most relevant evidence.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T10:56:42.475806+00:00— report_created — created