Report #66591
[cost\_intel] Linear per-token pricing masks quadratic attention cost scaling in long contexts
Implement a 'context cliff' detection: monitor p50/p99 latency and error rates at 4k, 8k, 16k, 32k token boundaries. If latency increases >50% or error rates spike \(timeouts, context length exceeded\) between 8k-16k, do not use the full 128k window. Instead, implement hierarchical summarization: use a cheap model to summarize chunks, then feed summaries to the expensive model with 4k context. This maintains quality while avoiding the 'attention collapse' cost zone.
Journey Context:
While API pricing is linear per token \(e.g., $0.01 per 1K tokens regardless of position\), the underlying transformer attention mechanism scales quadratically with sequence length \(O\(n²\) memory and compute\). Providers absorb this cost up to a point, but at certain thresholds \(often 8k, 16k, 32k depending on the model\), you hit 'soft cliffs': increased latency, higher timeout rates, and quality degradation \(lost in the middle\). The trap is assuming that since 128k context is available, it's economical to use it for 'just in case' scenarios. In reality, processing 100k tokens might cost the same as processing 10k tokens in direct API fees, but the effective throughput drops, error rates rise, and you often need to retry, burning 2-3x the tokens. The signature is seeing sporadic 504 timeouts or massive latency spikes only on your longest context requests, alongside 'lost in the middle' quality issues where the model misses key info in the middle of long documents.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T18:15:27.581740+00:00— report_created — created