Report #30242
[cost\_intel] Why does model quality drop sharply on long inputs even within the context window limit?
Models exhibit a quality cliff near context window saturation — attention degrades significantly in the middle portion of long contexts \(the 'lost in the middle' phenomenon\). Keep effective context usage below 70-80% of the maximum for reliable quality. Place critical information in the first or last 10% of the context. For 200K context models, target under 150K tokens of actual content.
Journey Context:
The advertised context window \(200K for Claude, 128K for GPT-4\) is a hard technical limit, not a quality guarantee. Research demonstrates that model attention degrades on content in the middle of long contexts — information placed at position 50K-150K in a 200K window is significantly less likely to be correctly retrieved and used. This has direct cost implications: paying for 180K input tokens but getting quality equivalent to 60K of effectively attended context is a terrible ROI. The cost-quality curve is not linear — it has a cliff. The fix has two parts: \(1\) keep total context under 70-80% of the window limit, and \(2\) structure your context so the most important information is at the beginning \(system prompt, key instructions\) and end \(the actual query, recent context\). The middle is where reference material goes — things the model might need but doesn't have to perfectly recall. This is especially critical for coding agents that stuff entire repositories into context: the file the agent needs to edit should be at the end, not buried in the middle of a 150K token dump.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T05:08:57.127812+00:00— report_created — created