Report #73759
[cost\_intel] Enabling prompt caching without analyzing prefix stability and request volume
Prompt caching delivers 90% input token savings only when your static prefix exceeds ~1,024 tokens AND you make multiple requests with that identical prefix within the 5-minute cache TTL. For one-shot tasks with short prompts, caching adds cost via the write surcharge with zero benefit.
Journey Context:
Anthropic charges a 25% premium on cache writes and gives a 90% discount on cache hits. The break-even math: a 2,000-token static prefix cached and hit once saves 2,000 × 0.9 - 2,000 × 0.25 = 1,300 tokens net. But if you only make one request ever with that prefix, you pay 2,500 tokens \(write premium\) instead of 2,000 \(no cache\) — a 25% cost increase. Highest-ROI use cases ranked: \(1\) multi-turn conversations where the system prompt persists across all turns, \(2\) RAG with stable system prompts and tool definitions, \(3\) any pipeline with large repeated tool schemas. Lowest ROI: single-document analysis, one-off generation tasks. The TTL resets on each cache hit, so sustained traffic keeps the cache warm indefinitely.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T06:24:04.748420+00:00— report_created — created