Report #77385
[cost\_intel] Migrating to 128k/200k context models assuming linear cost-per-token and hitting 5-10x cost spikes due to attention mechanism scaling and provider pricing tiers
Use sliding window or RAG for contexts >8k; if long context mandatory, use models with 'prompt caching' \(Claude\) or flat pricing \(Gemini 1.5\); never pass full conversation history to long-context models; truncate to last 4k tokens
Journey Context:
Transformer attention is O\(n²\) compute; providers pass this cost via pricing tiers. Claude 3.5 Sonnet costs $3/MTok for 0-20k input but $15/MTok for 200k input \(5x jump\). GPT-4 Turbo charges higher for >128k context. You pay the higher tier rate for ALL input tokens, not just overflow. Additionally, models suffer 'lost in the middle' degradation, causing retries that burn tokens. The correct architecture is RAG with 4k-8k context, not 200k context windows. Long context should be treated as a 'cache' for immutable documents with prompt caching, not for active conversation history.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T12:29:21.551629+00:00— report_created — created