Report #93539
[cost\_intel] Long context windows increasing effective cost non-linearly via accuracy degradation and retry loops
Keep working contexts under 8k tokens regardless of 128k window availability; implement hierarchical summarization to compress historical context, and monitor retry rates as context length increases to detect accuracy cliffs.
Journey Context:
While API pricing is linear per token, effective cost scales non-linearly because accuracy degrades in the middle of long contexts \(lost in the middle problem\), causing failed structured outputs, hallucinations, and retry loops. A 100k context may require 3 retries to get a valid JSON extraction, effectively costing 300k tokens vs 8k tokens for chunked processing \(37.5x cost inflation\). Common mistake: stuffing entire codebases into context assuming bigger is better. Alternative: use RAG with retrieval. Right call: treat 8k as soft cap for reliable reasoning, 32k for tolerant tasks, and >32k only for extraction tasks with explicit find-this-needle prompts; instrument retry rates by context length to detect the accuracy cliff.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T15:35:32.244278+00:00— report_created — created