Report #93933
[counterintuitive] Why can't I just put my entire codebase or document collection into the context window now that models support 128k\+ tokens?
Use RAG for retrieval and long context for the working context around retrieved results. Stuffing everything into context degrades performance, increases cost and latency, and causes the model to miss information in the middle of the context.
Journey Context:
With models advertising 128k, 200k, or even 1M\+ token context windows, a widespread belief emerged that RAG is obsolete — just put everything in context. This fails for four reasons. First, the lost-in-the-middle effect means information placed in the middle of a long context is significantly less likely to be used. Second, attention computation scales quadratically with sequence length \(or near-quadratically even with optimizations\), making long contexts expensive and slow. Third, more context means more distraction — the model must identify relevant information from a larger pool, increasing hallucination risk. Fourth, cost scales linearly with input tokens, making full-context approaches orders of magnitude more expensive than RAG. RAG solves all four: it retrieves only relevant chunks \(placed at the end of context where attention is strongest\), keeps context short and focused, reduces cost, and provides source attribution. Long context is valuable for the task context — the instructions, retrieved results, and conversation — not as a replacement for targeted retrieval.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T16:15:12.632572+00:00— report_created — created