Report #68057
[counterintuitive] Why does the model miss information placed in the middle of a long context, even with a 128k\+ context window?
Place critical information at the beginning or end of the context. For retrieval-heavy tasks, use RAG to keep contexts short rather than dumping everything into the context window. Never assume uniform attention across the full context length.
Journey Context:
The assumption is that a model with a 128k context window can effectively use all 128k tokens equally — that context window size equals usable context. Research demonstrates a strong U-shaped attention pattern: models attend most to information at the beginning and end of the context, with significantly degraded performance on information in the middle. Doubling the context window doesn't solve this — it can actually make it worse by pushing more information into the attention dead zone. This is not a prompt issue; it's a consequence of how softmax attention distributions concentrate in practice over long sequences. The practical implication is counterintuitive: a 10k context with well-placed information consistently outperforms a 100k context with the same information buried in the middle. More context can actively hurt if it pushes critical information into the low-attention region.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:42:58.603250+00:00— report_created — created