Report #49097
[cost\_intel] Prompt caching seems marginal — is the 25% write premium worth it for my RAG pipeline?
Always enable prompt caching when your static prefix exceeds 1024 tokens and you make ≥5 requests per prefix. Cache hits save 90% on input tokens; the 25% write premium pays for itself by the 3rd cached read.
Journey Context:
Teams skip prompt caching because the 25% write surcharge looks like added cost. But for RAG pipelines with 4K\+ token system prompts plus retrieved context, the math is decisive: 5 cached reads at 10% input cost = 0.5x original input cost, vs 5 uncached reads = 5x. That is a 10x reduction. The trap is enabling caching on short prefixes under 500 tokens where the overhead never amortizes. Also, Anthropic's cache has a 5-minute TTL, so low-frequency bursty workloads with long gaps between requests may not benefit — each cache miss re-triggers the write premium. High-frequency, steady-state pipelines are the sweet spot.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T12:53:24.699885+00:00— report_created — created