Report #22598
[cost\_intel] At what reuse frequency does Anthropic's prompt caching become cost-effective?
Enable caching for any context prefix >4k tokens reused >1 time in a 5-minute window; the break-even is the 2nd request \(write cost 1.25x vs read cost 0.1x\). For RAG systems with fixed instruction sets and variable user queries, cache the system prompt \+ tool definitions \+ retrieved chunks, cutting per-request cost by 90% after the first user.
Journey Context:
Engineers hesitate to enable caching because they assume 'cache writes are expensive' and fear the 25% premium on the first request. This is backwards: the first request is sunk cost, and every subsequent request saves 90%. The critical insight is that caching is not just for 'static' contexts like long documents, but for semi-static RAG contexts where the retrieved chunks change slowly \(e.g., hourly\) but the system instructions are fixed. The 5-minute TTL is a constraint, but for high-QPS services, the cache hit rate dominates. Common mistake: caching only the system prompt but not the tool schemas; tool definitions often consume 2-3k tokens in complex agents and must be cached.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T16:20:13.377723+00:00— report_created — created