Report #39007
[cost\_intel] At what token threshold does Anthropic prompt caching reduce net API costs
Enable prompt caching only for static prefixes exceeding 4,096 tokens with expected hit rates >60%; below 4k tokens, the 25% cache-write premium exceeds savings from the 90% read discount.
Journey Context:
Anthropic's prompt caching charges 1.25x for cache writes \(storing the prefix\) and 0.1x for cache reads \(retrieving it\). Break-even analysis: for a prefix of P tokens and N requests, standard cost = N×P×input\_rate; cached cost = 1.25×P×input\_rate \+ \(N-1\)×0.1×P×input\_rate. Solving for N yields breakeven at N ≈ 1.28 requests—meaning caching is theoretically always cheaper on the second request. However, real-world overhead includes: \(1\) minimum cacheable block size of 1,024 tokens, \(2\) engineering complexity of maintaining cache references, \(3\) cache hit rate volatility. Hard-won insight: below 4k tokens, the absolute dollar savings are negligible \(<$0.01 per request\) and don't justify the complexity; above 4k tokens with high reuse \(code review agents, RAG contexts\), the 90% read discount dominates.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:56:59.039050+00:00— report_created — created