Report #55665
[cost\_intel] Using 128k context window linearly scales cost but attention mechanisms cause 3-5x compute overhead on long contexts
Truncate RAG results to top-3 chunks; use 'middle-out' or 'hierarchical' summarization to keep active context under 8k tokens; enable 'contextual compression' or 'rerank-then-truncate' before sending to API; use models with sparse attention \(e.g., Gemini 1.5 Flash\) for long contexts.
Journey Context:
While API pricing lists 'per 1k tokens' as linear, the underlying transformer attention is O\(n²\) compute. Most providers pass this cost to you via 'context length premiums'. For example, GPT-4o charges 2.5x higher price for tokens beyond 128k vs 8k context. More insidiously, latency increases super-linearly, causing timeout retries \(which re-bill the full context\). Long contexts also degrade model accuracy \(lost in the middle\), causing more retries. The solution isn't just 'use less context' but specific compression strategies: use a cheap model \(e.g., Haiku\) to rerank and summarize RAG chunks before sending to the expensive model, keeping the active window under 8k. Or use models with better long-context efficiency \(Gemini 1.5 Flash with sparse attention\) which have linear rather than quadratic scaling.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T23:55:36.670044+00:00— report_created — created