Report #68704
[cost\_intel] Prompt caching saves money on all repeated prompts without conditions
Only enable prompt caching when context prefix exceeds 4k tokens and expected cache hit rate is above 50%; otherwise latency increases erase savings. For RAG with static system prompt plus retrieved chunks, expect 50-80% cost reduction on input tokens.
Journey Context:
Caching adds 10-20ms latency for cache writes and storage costs. On short prompts \(<1k tokens\), the write overhead dominates and you pay cache storage for no benefit. The break-even is around 4k tokens of static prefix with 50%\+ cache hit rate. Anthropic's caching offers 90% discount on cached input tokens, so large static contexts \(system prompts, few-shot examples\) are the only place this wins. Common mistake: enabling caching on every API call 'just in case'—this doubles costs on uncached short calls. Monitor cache hit rates; below 30%, disable caching immediately.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T21:48:15.280281+00:00— report_created — created