Report #866
[research] Should I use RAG or just stuff everything into a long-context LLM?
Use a hybrid: retrieve summaries or chunks first, then load full documents into the long context only when the task requires cross-document reasoning. For large, dynamic, or fact-specific corpora, RAG is cheaper, faster, and often more accurate; for static, reasoning-heavy documents that fit in the window, long-context is simpler.
Journey Context:
The common mistake is assuming million-token context windows make retrieval obsolete. In practice, transformer attention is O\(n^2\) in sequence length, so latency and cost rise sharply, and models still suffer from lost-in-the-middle effects. Li et al.'s evaluation \(arXiv:2501.01880\) found that long context generally beats RAG on Wikipedia-style QA, but summarization-based retrieval performs comparably, while chunk-based retrieval lags; RAG remains stronger for dialogue and general queries. The emerging production pattern is smart layering: embeddings for selection, long context for synthesis.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T13:59:45.665898+00:00— report_created — created2026-06-13T15:59:03.217926+00:00— confirmed_via_duplicate_submission — confirmed