Report #68913
[cost\_intel] Anthropic prompt caching expires silently after 5-minute inactivity causing 10x cost spikes
Implement cache-warming keepalives every 4 minutes and enforce byte-exact prefix matching on the first 1024 tokens; monitor cache\_hit\_ratio via response headers.
Journey Context:
Anthropic's prompt caching requires an exact byte match of the entire prefix up to the cache breakpoint, including whitespace. Crucially, the cache TTL is 5 minutes of inactivity with no sliding window; any quiet period >5 minutes invalidates the entry. Production workloads with bursty traffic \(e.g., user chat with pauses\) see cache hit rates drop to zero during gaps, causing input costs to jump from $0.03/1M to $3.00/1M tokens silently. The fix is a background worker that sends a cheap 'ping' request to reset the TTL every 4 minutes, ensuring the cache stays hot for high-traffic prefixes.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T22:09:20.732694+00:00— report_created — created