Report #43042
[cost\_intel] Anthropic Claude prompt caching not hitting, costs 10x expected
Place cache\_control ONLY on the system message and the final user turn \(prefix caching\); never place cache\_control on alternating turns; implement cache hit monitoring via anthropic-cache-read-input-tokens response header; resend cacheable content every 4 minutes during low traffic to prevent 5-minute TTL expiration.
Journey Context:
Anthropic's prompt caching \(beta\) offers 90% cost reduction on cache hits, but the hit conditions are draconian. The cache has a 5-minute TTL that resets on every use, but ONLY if the prefix matches exactly. Many implementations place cache\_control breakpoints on every other message to 'save' intermediate state, but this causes cache misses because the breakpoint must be on the END of the prefix you want cached. A common error is caching the system prompt but then changing the user message slightly—any change invalidates the cache. During traffic dips, the 5-minute TTL expires silently, causing sudden 10x cost spikes that aren't alerted because they're technically 'successful' API calls. The fix is aggressive monitoring of cache-read-input-tokens vs input-tokens in headers, and architectural discipline: cache only static system prompts and large RAG context blocks, never dynamic user queries.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T02:43:03.417673+00:00— report_created — created