Report #90862
[cost\_intel] Long-context window cost non-linearity and attention dilution
Keep working context under 16k tokens via hierarchical summarization; use 128k models only for 'needle in haystack' retrieval, not reasoning; expect 8-15x cost per request vs 8k window due to higher per-token pricing and KV-cache constraints
Journey Context:
While per-token pricing appears linear, providers charge premiums for 128k/200k context models \(e.g., GPT-4 128k costs more per token than 8k\). More importantly, inference latency and compute scale super-linearly due to KV-cache memory pressure and quadratic attention complexity \(O\(n²\) in worst case\). The hidden cost: filling 128k with retrieved documents dilutes attention, causing the model to miss key information \('lost in the middle'\), necessitating retries or re-ranking. Quality degradation signature: Accuracy on multi-hop reasoning drops sharply after 32k context for most models. Alternative: aggressive retrieval with rerank \(Cohere Rerank, Cross-encoders\) to reduce top-k from 100 to 5 documents, shrinking context to <8k and improving accuracy while cutting costs 80%.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T11:06:26.338427+00:00— report_created — created