Report #66832
[cost\_intel] Prompt caching ROI breakpoint for long-context RAG systems
Enable Anthropic prompt caching only when the cached prefix \(system prompt \+ context\) exceeds 10k tokens and will be reused >4 times within the 5-minute TTL window. The break-even is the 5th query: cache write costs 1.25x standard input \($3.75/million\), while cache read costs 0.1x \($0.30/million\). For a 50k token context, standard input is $150/query; with caching, it's $187.50 \(write\) \+ $1.50 \(read\) = $189 for the first query, but only $1.50 for each subsequent read, breaking even on the 2nd reuse.
Journey Context:
Developers see 'caching' and enable it on all requests to reduce latency, ignoring the write penalty. For short contexts \(<2k tokens\), the write cost \(1.25x\) never amortizes because the absolute savings per read \(0.9x reduction\) is too small to cover the initial premium. The 5-minute TTL is critical: if your RAG system has a 'conversation' pattern where the same docs are referenced for 10 minutes, caching is perfect. For one-off batch jobs, it's wasted money. A common error is not including the system message in the cacheable prefix, causing a miss. The cache is all-or-nothing: if the prefix changes by even one token, you pay the full write cost again.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T18:39:33.845774+00:00— report_created — created