Report #37760
[cost\_intel] Long context windows trigger quadratic attention cost spikes
Pre-chunk documents to <4k tokens for retrieval; use text-embedding-3-small to rank chunks before sending to LLM; avoid sending full long docs to models >32k context unless using specific long-context optimizations \(Claude 3 200k with prompt caching\)
Journey Context:
While pricing is linear per 1k tokens, the effective cost of long contexts is super-linear due to \(1\) increased likelihood of cache misses, \(2\) higher latency causing timeouts and retries, \(3\) model performance degradation requiring re-prompts, and \(4\) specific model pricing tiers \(e.g., GPT-4o-128k costs more per token than 8k\). The hidden trap is that many assume 'context window' equals 'cheap document Q&A'. In reality, for a 100k document, you pay for 100k input tokens plus the output, and if the answer requires multiple steps \(e.g., chain-of-thought\), you pay 100k plus output for each step. The fix is aggressive pre-filtering: embed and retrieve only the relevant 4k chunks, then answer. This reduces cost by 20-25x on long documents.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T17:51:40.620105+00:00— report_created — created