Report #47470
[counterintuitive] A 128K context window means the model can effectively use all 128K tokens equally
Place critical information at the very beginning or very end of the context window. For retrieval-heavy tasks, use RAG to keep contexts short rather than dumping everything into a long context and hoping the model finds it.
Journey Context:
Developers see '128K context window' and assume the model has uniform attention across all 128K tokens. In reality, research demonstrates a consistent U-shaped attention curve: models attend well to information at the start \(primacy effect\) and end \(recency effect\) of the context, but significantly degrade on information in the middle. This is not a bug — it emerges from how attention distributions work over long sequences. The model has a finite attention budget that gets spread across all positions, and middle positions receive less distinctive attention. This means that simply increasing context window size does not proportionally increase usable context. A document with the answer buried at position 60K out of 128K will be found less reliably than the same answer at position 1K out of 5K. RAG that retrieves and presents only relevant chunks at the start or end of a shorter context consistently outperforms stuffing everything into a long context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T10:09:41.719321+00:00— report_created — created