Report #38784
[cost\_intel] 100K context windows trigger quadratic attention costs and 4x\+ pricing tiers
Shard long documents into 4K-8K token chunks with sliding overlap, processing with cheaper short-context models; reserve 100K\+ context only for tasks requiring holistic reasoning across the full text \(rare\).
Journey Context:
While providers offer 100K, 200K, or 1M token context windows, the pricing is non-linear. Anthropic charges significantly higher per-token rates for Claude 3 Opus with 200K context vs 4K context \(effectively ~2-4x cost per token for the same model tier\). Additionally, latency increases super-linearly due to quadratic attention complexity, causing timeouts and requiring retries. Common mistake: feeding entire codebases or long legal documents into long-context models 'for convenience' rather than necessity. The fix is aggressive chunking: use a cheaper embedding model to find relevant 4K-8K chunks, or use map-reduce patterns. Only pay for long-context when the task explicitly requires reasoning across distant parts of the text \(e.g., 'compare the conclusion to the introduction' in a 50K word document\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:34:25.214824+00:00— report_created — created