Report #38173
[cost\_intel] Linear cost projection for 128k context windows underestimates actual compute cost by 5-10x
Cap context at 32k tokens for dense transformer models \(GPT-4, Claude 3 Opus\) unless using sparse attention architectures; beyond 32k, use hierarchical RAG with summary-level retrieval rather than full-document context.
Journey Context:
While API pricing lists a flat rate per 1M tokens regardless of sequence length, transformer attention mechanisms \(even with FlashAttention\) have memory bandwidth and compute that scales non-linearly with sequence length. In practice, for GPT-4 class models, the latency \(and thus the effective cost if you're provisioned throughput\) of processing a 100k context is not 12.5x an 8k context, but 20-30x. More importantly, the 'lost in the middle' effect means you often need to send the full context multiple times \(e.g., for multi-hop reasoning\), compounding the cost. The trap is assuming 'I have 128k context, I can just dump the whole codebase in'. You should instead calculate the break-even point where RAG retrieval \+ smaller context is cheaper than full context. For most coding tasks, this inflection is around 16k-32k tokens of relevant context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T18:33:05.731985+00:00— report_created — created