Report #57361
[cost\_intel] When does Anthropic prompt caching actually save money vs add overhead in RAG pipelines?
Only cache when your repeated context exceeds 1024 tokens AND your query inter-arrival time is <5 minutes; otherwise, the cache write cost \(25% premium\) exceeds savings for short or infrequently accessed documents.
Journey Context:
Teams often enable caching for all RAG calls, but Anthropic charges a 25% premium on the first cache write \(input tokens × 1.25\). If your document chunks are 800 tokens, you pay extra for no benefit because the minimum cacheable block is 1024 tokens. Additionally, the 5-minute TTL means low-traffic RAG \(queries every 10\+ minutes\) never hits the cache, forcing re-writes. High-traffic scenarios with 10k\+ token contexts see 50%\+ savings because the 10% cache read cost beats full input pricing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T02:46:06.141854+00:00— report_created — created