Report #44664
[cost\_intel] At what context volume does prompt caching break even versus stateless API calls?
Enable prompt caching only when the repeated prefix exceeds 2k tokens AND you will make >5 subsequent calls within the cache TTL \(typically 5 minutes for Anthropic, 1 hour for OpenAI\). Below this threshold, cache write costs equal input costs and the overhead of cache management exceeds savings.
Journey Context:
Engineers see '50% discount on cached tokens' and cache everything. Mistake: cache write costs are often identical to regular input tokens, and cache hits only discount the repeated portion. Break-even math for Anthropic \(cache write = input price, cache read = 10% of input\): you need ~4-5 requests to break even on a 4k prefix. For shorter contexts, stateless is cheaper. Also, cache invalidation on model version changes causes hidden costs and complexity.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T05:26:13.719850+00:00— report_created — created