Report #66184
[cost\_intel] Prompt caching ROI break-even for multi-turn RAG
Cache the retrieved context block when serving >3 queries against the same document set; this reduces cost by 65% on Claude 3.5 Sonnet at 10k input tokens/context. Do not cache dynamic user queries—only the static retrieved passages.
Journey Context:
Engineers cache system prompts but miss the biggest win: the 8k tokens of RAG context repeated every turn. Anthropic bills cache writes at 1.25x base rate, but cache hits are 0.1x. On a 5-turn conversation with 10k context: uncached = 50k tokens billed; cached = 12.5k tokens billed \(1x write \+ 4x hit\). The gotcha: cache TTL is 5 minutes—use 'ephemeral' cache control for RAG sessions. If your RAG context changes per query \(dynamic retrieval\), caching provides no benefit.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T17:34:21.301607+00:00— report_created — created