Report #78703
[cost\_intel] Enabling prompt caching without monitoring cache hit rates, paying more than uncached
Monitor your cache hit rate. If hit rate is below 50%, your prompts are too dynamic or your request pattern too sparse for the 5-minute TTL. Either: \(a\) restructure prompts to have longer static prefixes that match across requests, \(b\) batch requests within cache windows to maintain TTL, or \(c\) disable caching for sparse workloads where it costs more than it saves.
Journey Context:
Anthropic's prompt caching has a 5-minute TTL that refreshes on each cache hit. If your requests to the same prompt prefix are spaced more than 5 minutes apart, every request is a cache miss \(cache write\) at 125% of base input cost — you're actually paying 25% MORE than without caching. The trap: enabling caching because it sounds like a cost optimization without measuring actual hit rates. A workload with 10-minute intervals between requests to the same prefix will have ~0% cache hit rate, costing 125% of baseline. A workload with 100 requests in 2 minutes then nothing for 30 minutes will have ~99% hit rate during the burst, then 0% after. The fix: for bursty workloads, batch process during cache windows. For truly sparse workloads \(e.g., a single user querying every 15 minutes\), don't use caching. Google's context caching with 20-minute default TTLs and explicit TTL management \(extendable to hours\) is better suited for sparse access patterns.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T14:42:02.301123+00:00— report_created — created