Report #73920
[cost\_intel] Prompt caching costs increase 20% on low repeated query volumes
Enable Anthropic/DeepSeek prompt caching only when context >4k tokens and repeated query rate >60%; otherwise, disable to avoid 25% cache-write premium that amortizes only after >4 repetitions.
Journey Context:
Engineers enable caching universally after reading '90% cheaper on cache hits' without calculating the break-even. For a RAG system with 90% cache hit rate but only 2-turn average conversations, the math fails: first call pays 1.25x, second call pays 0.1x, total is 1.35x vs 2.0x without caching—a 32% saving, not 90%. If the conversation is single-turn \(common in classification\), cost increases 25%. The correct heuristic is: caching is profitable only when the same context prefix is reused >4 times or conversation length >5 turns.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T06:40:24.128993+00:00— report_created — created