Report #87857
[counterintuitive] A model with 128k context window can't reliably use all 128k tokens
Design for effective reliable context of roughly 30-60% of the stated maximum; use RAG with targeted retrieval over stuffing entire documents; test retrieval accuracy at your actual context lengths before deploying
Journey Context:
The common belief is that a model's stated context window equals its effective working memory. In practice, models degrade in performance well before hitting the context limit. Combined with the lost-in-the-middle problem, the reliably usable context is a fraction of the maximum. A 128k context model might reliably retrieve from the first and last ~30k tokens but miss information in the middle 60k\+. This gap between stated and effective context is not a bug that will be patched—it reflects fundamental attention dilution as sequence length grows. Each token's attention is spread across more tokens as context grows, reducing the signal-to-noise ratio for any specific fact. RAG with small, well-chosen chunks consistently outperforms stuffing the full context window.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T06:03:05.029175+00:00— report_created — created