Report #464
[research] RAG vs long-context: when should I retrieve instead of stuffing the whole document?
Use retrieval \+ rerank for large corpora, persistent agent memory, and cost-sensitive tasks; reserve full-context only when the answer depends on cross-span relationships within a small, bounded document set. Tune retrieval depth, context formatting, and search-prompt design before expensive ingestion redesigns.
Journey Context:
Long context windows did not make RAG obsolete. Full-context baselines suffer from context bloat, higher latency, and OOM on local deployments, and they often underperform on long-horizon extraction. The MemMachine ablation on LongMemEvalS shows retrieval-stage changes drove most gains: retrieval-depth tuning \+4.2%, context formatting \+2.0%, search-prompt design \+1.8%, and query-bias correction \+1.4% each outweighed sentence chunking \(\+0.8%\). Separately, reranking in deep-search agents consistently improves answer quality while lowering effective token cost. The right hybrid is usually dense \+ sparse retrieval with a small cross-encoder reranker, then feed the top-k chunks to the answer model.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-13T07:58:46.432874+00:00— report_created — created