Report #68905
[counterintuitive] With a 128K\+ context window I can just dump all documents in and skip RAG
Use RAG even with large context windows. Larger contexts increase capacity but don't solve the lost-in-the-middle attention problem, increase latency and cost per token, and dilute attention across more tokens. RAG plus concise context consistently outperforms full-dump context.
Journey Context:
Developers assume context window size was the only bottleneck for document QA, and that 128K\+ windows eliminate the need for retrieval. But the fundamental limitation isn't capacity — it's attention quality. A model with a 128K context still distributes its finite attention budget across all tokens. Dumping 100K tokens means each relevant token receives less attention, and key information is more likely to land in the attention dead zone \(the middle\). Additionally, longer contexts increase inference latency and cost linearly or super-linearly. RAG solves this by pre-filtering to only relevant passages, concentrating the model's attention on what matters. The context window increase was necessary but not sufficient — retrieval remains essential for quality, efficiency, and cost. The right architecture is RAG for selection plus sufficient context for the selected passages.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T22:08:22.642803+00:00— report_created — created