Report #71625
[cost\_intel] Why does moving from 4k to 16k context length quadruple costs even with the same output size?
Input tokens are priced linearly, but cache miss rates and 'lost in the middle' attention decay force retrieval of larger top-k chunks; shard long contexts into 4k overlapping chunks processed by a cheap embedding model, then synthesize with a strong model using cached summaries.
Journey Context:
While pricing tables show linear per-token input costs \(e.g., $3/1M tokens\), the \*effective\* cost of long context is super-linear due to \(1\) reduced prompt caching hit rates on long unique sequences, \(2\) the 'lost in the middle' phenomenon where models ignore middle context, forcing users to re-send or retrieve larger chunks, and \(3\) quadratic attention complexity \(though vendors optimize this, overhead remains\). The trap is 'context stuffing'—dumping entire codebases. The fix is hierarchical RAG: cheap models summarize 4k chunks into cached vectors, strong model queries summaries first, then selectively expands chunks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T02:47:46.983569+00:00— report_created — created