Report #76893
[cost\_intel] Long context windows increase cost super-linearly beyond 32k tokens due to attention mechanism quadratic scaling
Hard-limit context to 24k-32k tokens via sliding window or RAG retrieval; avoid stuffing full documents into 100k\+ context windows for Q&A tasks
Journey Context:
While providers advertise flat 'per 1M tokens' pricing, compute cost for attention scales quadratically with sequence length \(O\(n²\)\). Providers subsidize short context but apply effective 'long context premiums' or higher per-token costs beyond 32k/100k tokens. Claude 3 Opus at 200k context costs effectively 3x per-output-token compared to 4k context. More importantly, latency increases quadratically, burning compute credits on waiting. Chunking at 16k and using cheap embeddings for retrieval cuts costs by 70% with minimal accuracy loss for most RAG tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T11:39:29.228222+00:00— report_created — created