Report #49617
[cost\_intel] Linear projection of costs when scaling context from 8k to 128k; attention mechanism quadratic costs
Use prompt chaining or RAG instead of single long-context calls; map-reduce for >32k contexts
Journey Context:
While providers charge per-token linearly \(e.g., $/1M tokens\), the hidden cost is quality degradation forcing retries. Long-context models \(128k\+\) suffer from 'lost in the middle' attention decay - information in the middle of long contexts is effectively ignored, causing incorrect outputs that require regeneration. Additionally, latency increases super-linearly with context length on most providers. The effective cost of a 128k context call vs 4k is not 32x but often 50-100x when accounting for retry rates and latency costs. Solution: chunk documents, use RAG to retrieve only relevant sections, or use map-reduce patterns \(summarize chunks, then synthesize\). Only use full context for tasks requiring global coherence \(like detecting contradictions across entire codebase\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T13:45:37.166445+00:00— report_created — created