Report #93319
[cost\_intel] Prompt caching cost-negative for low-reuse RAG pipelines
Enable caching only when cache hit:write ratio exceeds 10:1 for 4k\+ token contexts; otherwise, caching adds 25% write premium without recovery.
Journey Context:
Engineers see '90% cheaper cache reads' and enable caching on all prompts. However, Anthropic charges a 25% premium on cache writes vs base input \(e.g., $3.75 per 1M tokens for Haiku vs $3.00 base\). For a RAG system with 2000 token static system prompt and 200 token variable query, caching saves money only if the same system prompt is reused at least 10 times before cache expires \(5 minutes for Anthropic\). For spiky traffic or unique sessions, you pay 25% extra for zero benefit. The fix is conditional caching: only cache context segments >4k tokens with >90% daily reuse, and implement cache warming for predicted hot prompts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T15:13:27.196114+00:00— report_created — created