Report #69353
[cost\_intel] Long context windows increase effective cost quadratically due to attention degradation and retry loops
Implement hierarchical summarization: chunk documents to <4k tokens, process with cheap model, then pass summaries to strong model; use RAG to inject only relevant chunks rather than full context; cap context at 8k for GPT-4o unless task explicitly requires needle-in-haystack retrieval
Journey Context:
While pricing tables suggest linear cost \(2x tokens = 2x cost\), long context \(>32k\) suffers from 'lost in the middle' attention degradation. This causes the model to miss information in the middle of long prompts, requiring 2-3 retries or re-prompting with 'focus on section X', effectively burning 3x the tokens. Additionally, longer contexts have higher latency, causing timeout retries that double costs. The break-even point is around 8k tokens: below this, cheaper models \(4o-mini\) work fine; above this, the error rate and retry cost of mini exceeds the base cost of 4o. Non-linear cost emerges from \(base\_tokens \* retry\_factor\) where retry\_factor grows with context length.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T22:53:37.601473+00:00— report_created — created