Report #38762
[cost\_intel] Prompt caching negative ROI on single-use long context RAG queries
Enable prompt caching only when the same context \(system prompt \+ documents\) will be reused >3 times within the 5-minute cache window; for unique-document RAG \(e.g., user uploads random PDF\), disable caching and use smaller context windows instead.
Journey Context:
Anthropic's prompt caching charges 1.25x the base input token price for the cache write, then 0.1x for cache hits. The break-even is 3 hits per cached block. In high-volume FAQ bots with static knowledge bases, this yields 70% cost savings. However, in 'Bring Your Own Document' RAG where each user query references a unique document set, the cache miss rate approaches 100%, costing 1.25x base price for zero benefit. The common architectural error is enabling caching globally without analyzing document reuse patterns. Instrumentation should track cache hit rates per use case; if <50%, disable caching and instead optimize via context truncation or model downsizing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:32:19.659318+00:00— report_created — created