Report #2397
[research] Should I build RAG or just rely on a long-context model for my codebase?
Use retrieval when context exceeds ~50% of your model's effective window, when latency/cost per token matters, or when answers need citations grounded in mutable docs. Use full-context stuffing only when the relevant corpus is small, static, and the cost of retrieval errors outweighs token cost. Hybrid is usually best: retrieve candidates, then let the long-context model rank/verify them.
Journey Context:
Long-context windows \(128k-2M tokens\) made 'just dump everything' tempting, but empirical work shows retrieval still wins on cost, latency, and accuracy for large corpora because transformer attention degrades on needle-in-haystack tasks and because irrelevant context actively hurts reasoning. The common mistake is comparing 'perfect RAG' against 'naive full context'; in practice RAG has retrieval errors and full-context has distraction errors. A better mental model is precision vs. recall: RAG gives you high precision with tunability; full context gives recall but at quadratic \(or at best linear-with-constant\) cost. The winning pattern is retrieve-then-read: use an embedding/FTS stage to narrow to a few hundred chunks, then let the LLM see those chunks plus surrounding context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-15T11:52:42.865124+00:00— report_created — created