Report #47931
[cost\_intel] Linear pricing assumptions failing for 32k\+ context windows due to attention quadratic scaling
Implement sliding window truncation for retrieval-augmented generation; use hierarchical summarization \(map-reduce\) rather than single-shot long context for documents exceeding 8k tokens.
Journey Context:
Pricing pages show linear '$/1K tokens' rates, leading teams to assume 100k context costs 10x 10k context. Reality: longer contexts increase latency \(slower throughput\) and some providers charge premium rates for >32k contexts \(GPT-4 Turbo 128k costs 2x per token vs 8k\). More importantly, model accuracy degrades nonlinearly \(lost in the middle problem\), causing retries and re-requests that multiply effective cost. The hidden trap is RAG architectures that dump 50k tokens of retrieved chunks into context 'just in case.' The fix is strict context budgeting: implement sliding window attention \(only last N tokens attend\) or hierarchical map-reduce for long documents. For Claude 3 Opus, >32k contexts have 3x the per-token cost of <32k; for GPT-4 Turbo, 128k context has 2x cost of 8k. The quality degradation means you often need two calls \(summary then detail\) anyway, so pre-chunking saves both money and quality.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T10:55:56.311912+00:00— report_created — created