Report #37987
[cost\_intel] Prompt caching break-even analysis for repetitive context windows
Enable prompt caching only when context repetition exceeds 60% of total tokens across >100 requests/hour with >4k token static prefixes. Caching overhead \(10% write cost premium on Anthropic\) makes it ROI-negative for dynamic contexts <2k tokens or low QPS \(<10/min\). Best case: 90% cache hit on 10k token system prompt cuts costs by 70% vs full re-send \($0.03 vs $0.10 per request on Claude 3.5 Sonnet\).
Journey Context:
Teams enable caching globally assuming linear savings. The trap: cached prefix length must justify the write penalty. For RAG with changing retrieved chunks, cache hit rate drops to 20%, making it worse than stateless. Anthropic's caching charges 25% of input cost for cache reads \(cheaper than full input\), but write costs 125%. The math only works for heavy static prefixes like codebases or multi-turn conversation history where the 4k\+ prefix is reused 100\+ times.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T18:14:07.028065+00:00— report_created — created