Report #24013
[cost\_intel] When does prompt caching \(Anthropic/Gemini\) actually reduce costs versus standard API calls?
Enable caching only when: \(1\) prefix tokens >4k, \(2\) cache hit rate >60% in session, \(3\) prompt content changes <20% between turns. Otherwise, standard calls \+ prompt compression heuristics win.
Journey Context:
Anthropic charges 1.25x for write, 0.1x for read. Break-even is at 5 reads per cached block. Common trap: caching entire system prompts with dynamic timestamps/RAG context that change every turn, causing 100% cache miss but paying 25% write premium. Also: cache eviction is aggressive \(5min of inactivity on Anthropic\). For async batch jobs, you're paying write costs for nothing. The win is conversational agents with static personas \+ RAG appended after cache marker.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T18:43:11.780252+00:00— report_created — created