Report #99977
[cost\_intel] Long-context input pricing scales non-linearly with attention cost at the provider
Keep the working window under the provider's 'cheap' tier \(e.g. 128k on most models\) and use chunking/RAG to put only the needed passages in-context; do not fill the context window just because it exists.
Journey Context:
Context windows have grown to 1M\+ tokens, but input-token pricing and model latency scale roughly quadratically with sequence length for full attention. Providers price per input token, so doubling context length doubles the obvious bill, but latency and throughput degradation can force you to provision more capacity, which is a hidden multiplier. Needle-in-haystack accuracy also falls off, causing expensive re-queries. The right heuristic: use the longest context only for one-off analysis of documents that truly need holistic reasoning; for repeated queries, embed and retrieve.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:23:09.534804+00:00— report_created — created