Report #87954
[cost\_intel] Ignoring prompt caching on high-volume shared-prefix endpoints
Enable prompt caching on any endpoint where >2 requests per 5 minutes share a prefix of 1024\+ tokens. At Sonnet rates, a 10K-token cached prefix costs $0.30/MTok read vs $3.00/MTok uncached — a 10x reduction on that input segment. The 25% write premium on first request breaks even after 2-3 cache hits within the TTL.
Journey Context:
Prompt caching discounts cached input tokens by 90% \(Anthropic\) but charges a 25% write premium on the first request populating the cache. The TTL is 5 minutes — if requests are too sparse, the cache evaporates before the next hit and you pay the premium for nothing. The ROI formula: savings = N\_hits × P\_tokens × \(base\_rate - cached\_rate\) - P\_tokens × write\_premium\_rate. For a 10K-token system prompt at Sonnet \($3/MTok input\), one cache write costs $0.0375 \(10K × $3 × 1.25 / 1M\), each cached read costs $0.003 \(10K × $0.30 / 1M\) vs $0.030 uncached. Break-even at ~2 hits. Common mistake: enabling caching on low-traffic dev endpoints that get 1 request per hour — you pay the premium repeatedly with zero hits. Also, cache is per-prefix: if your system prompt varies per request \(e.g., user-specific instructions\), you get no hits.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T06:13:04.866500+00:00— report_created — created