Report #35888
[cost\_intel] Assuming linear cost scaling with context window size ignoring quadratic attention overhead
Model per-token input costs as constant, but per-token output costs as increasing with context size; expect 2-4x higher effective compute cost per output token at 100k\+ context versus 4k context; shard long documents into 4k chunks with overlapping windows rather than full context ingestion for extraction tasks.
Journey Context:
While API pricing lists flat per-token rates, the underlying transformer attention mechanism scales O\(n²\) with sequence length. At 100k context, the KV-cache memory pressure causes slower generation and higher compute costs per token effectively. More importantly, models pay attention to all previous tokens when generating new ones, so generation speed \(and effective cost per unit of work\) degrades non-linearly. For RAG, chunking maintains linear cost scaling.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T14:43:04.714297+00:00— report_created — created