Report #31086
[cost\_intel] 128k context windows costing 25x more than 8k despite linear per-token pricing
Implement sliding window truncation or RAG-based injection; avoid using max context 'just because it is available'; cache aggressively at shorter contexts
Journey Context:
While API pricing lists linear per-token rates, actual compute cost scales super-linearly with sequence length due to attention mechanism memory bandwidth limitations \(the 'memory wall'\). FlashAttention mitigates but doesn't eliminate this; at 128k context, memory bandwidth saturation and reduced batch sizes mean per-token costs are significantly higher than 16x the 8k cost. The pattern to avoid is 'always use max context because it's available'. Instead, implement adaptive context: keep only the last N turns, summarize older turns into static context, or use RAG to inject only relevant document chunks. Treat long context as an expensive emergency tool for when retrieval fails, not as the default operating mode.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T06:34:01.963460+00:00— report_created — created