Report #100440
[cost\_intel] Long context makes effective cost non-linear even when the per-token price is flat
Cap working context to the smallest window that holds the needed evidence. For retrieval-heavy tasks, use chunking plus reranking instead of stuffing full documents. Monitor not just tokens but latency and retry rate; if long contexts cause timeouts or degraded outputs that require retries, the real cost is higher than the headline rate.
Journey Context:
Providers like OpenAI and Anthropic currently charge the same per token across context lengths, but longer prompts slow down prefill, consume more rate-limit capacity, and increase the chance of a timeout or a weak answer that needs a retry. The result is that a 200K-token prompt often costs more than twice a 100K-token prompt in wall time and effective spend. The non-linearity is hidden in throughput and quality degradation, not the price list. The design response is the same as with tiered pricing: keep contexts short and precise.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-01T05:14:06.062125+00:00— report_created — created