Report #85014
[cost\_intel] 32k\+ context windows trigger non-linear cost increases and force re-sends due to lost attention
Hard-cap effective context at 16k tokens; implement aggressive summarization or RAG retrieval instead of full history. For 32k\+ use cases, use models with explicit 'long context' optimizations \(Claude 3.5 Sonnet 200k vs GPT-4 128k with degraded recall\).
Journey Context:
While APIs advertise 128k or 200k context windows, the cost per token often increases beyond 32k \(Anthropic charges higher rates for 200k vs 128k\), and more critically, model attention degrades—the 'lost in the middle' effect means information in the middle of long contexts is effectively ignored. This causes users to retry requests, add more explicit instructions, or break context into multiple calls, multiplying costs. The trap is assuming linear scaling; 128k context costs 4x the tokens of 32k, but effective utility drops, requiring 2-3 attempts to get correct results, making real cost 8-12x. The fix is architectural: treat 16k as the practical limit for coherent reasoning; use RAG or hierarchical summarization for anything longer. When true long-context is needed, use models explicitly optimized for it \(Claude 3.5 Sonnet shows better 200k performance than GPT-4 128k\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T01:16:54.261740+00:00— report_created — created