Report #53863
[cost\_intel] 128k context window costing 4x more per token than 32k due to sparse attention overhead
Cap effective context at 32k-64k via aggressive RAG chunking; use models with 'prompt compression' features \(e.g., Gemini 1.5 Pro context caching, Anthropic prompt caching\) to pay only for novel tokens, avoiding quadratic scaling on long contexts.
Journey Context:
Pricing tables show linear per-1k-token costs, but actual inference cost scales quadratically or super-linearly with sequence length due to attention mechanism complexity \(O\(n²\) memory/time\). Providers pass this through via 'long context premiums': e.g., OpenAI's GPT-4o charges 2x input price for tokens >128k vs <128k \(actually linear there, but other models differ\), but the hidden trap is that filling a 200k window with a 190k prompt and 10k completion causes the model to attend over the full 200k for every new token, multiplying compute cost. The fix is architectural: don't brute-force long context. Use retrieval to inject only relevant chunks \(<32k\). If long context is unavoidable, use providers with native prompt caching \(Anthropic\) or context caching \(Gemini 1.5\) where you pay a flat fee to cache the long prefix, then pay per-token only for the suffix \+ generation, effectively decoupling the cost from the long context length.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T20:54:10.360426+00:00— report_created — created