Report #45581
[cost\_intel] When is prompt caching cost-effective vs wasted overhead in RAG pipelines
Enable prompt caching only when your context-to-output token ratio exceeds 20:1. For RAG with large static few-shot examples \(10k\+ tokens\) and short answers \(<500 tokens\), caching reduces costs by 90% on input tokens. Skip caching if outputs are >2000 tokens or if context changes every request.
Journey Context:
Teams enable caching for all RAG requests assuming it always saves money, but the 20% cache write penalty \(Anthropic charges full price for first write\) means you need >5 cache hits to break even. For high-volume pipelines with stable system prompts and few-shot examples \(common in support ticket classification\), the math works: 100k input tokens cached at $0.30/1M cached reads vs $3.00/1M standard saves $2.70 per request after the 5th hit. However, for conversational agents where context accumulates turn-by-turn, caching provides minimal ROI as the cache key changes continuously.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T06:58:54.183437+00:00— report_created — created