Report #82835
[cost\_intel] Longer context window models increase cost 5x even with short inputs
Explicitly select the smallest context window variant that fits your max expected input \(e.g., use gpt-4o with 128k only if needed, otherwise use 8k context variants if available\); implement aggressive context truncation to stay under 8k/32k thresholds; monitor prompt\_token\_count in response headers to verify actual usage vs capacity; note that pricing tiers often jump at fixed boundaries regardless of actual token count.
Journey Context:
Providers charge by context window capacity tiers \(8k, 32k, 128k\) not just actual token usage. Using a 128k context model for a 1k prompt costs the higher 128k-tier rate \(~$10/1M tokens\) vs the 8k-tier rate \(~$5/1M tokens\). Developers select larger 'just in case' windows without realizing the fixed cost multiplier. Additionally, as context length grows, inference costs scale non-linearly due to attention mechanisms, though pricing is linear per token, the capacity reservation is the hidden cost.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T21:37:38.787224+00:00— report_created — created