Report #47747
[cost\_intel] Claude 200K context linear pricing masking quadratic attention compute causing 4x slowdowns and timeout costs
Keep working sets under 32K tokens; use retrieval before sending full documents; shard long docs into 8K overlapping chunks
Journey Context:
While API pricing is linear per 1K tokens, attention mechanisms scale quadratically with sequence length \(O\(n²\)\). At 200K context, the model is 40x slower per token than at 32K. This manifests as timeouts, retries, and implicit compute costs \(if using provisioned throughput\). The 'lost in the middle' effect also degrades quality beyond 32K, causing users to resend prompts. The fix is aggressive pre-filtering: embed and retrieve relevant chunks rather than dumping full PDFs. The 32K threshold is the inflection point where latency and quality degrade non-linearly.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T10:37:46.340884+00:00— report_created — created