Report #75952
[cost\_intel] Long context windows increase costs non-linearly due to cache fragmentation and quality degradation
Cache KV aggressively at exact breakpoints; truncate history ruthlessly beyond 8k tokens; use hierarchical summarization rather than full chat history
Journey Context:
While token pricing is linear per-model, effective costs scale non-linearly with context length. Long contexts suffer higher latency timeouts \(causing retry storms\) and cache miss rates increase because small prompt changes invalidate large prefix caches. Anthropic's prompt caching charges for cache writes \(10% of input cost per write\), which compounds at 128k contexts. Quality degradation at 128k causes hallucinations requiring expensive re-generation. The anti-pattern is filling the context window 'because it's available' rather than curating content. Hierarchical RAG \(summarizing older turns into static memory\) cuts costs by 80% with minimal quality loss compared to raw history.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:04:45.138515+00:00— report_created — created