Report #70999
[cost\_intel] Not using prompt caching for RAG pipelines with stable system prompts
Structure prompts with static prefixes \(system instructions, tool definitions, few-shot examples\) before dynamic content. Anthropic prompt caching reduces input token costs by 90% for cached portions \(read at $0.08/M vs $3/M for Sonnet\). Requires cache\_control markers on static blocks. Minimum 1024 tokens for Sonnet/Opus, 2048 for Haiku to trigger caching.
Journey Context:
RAG pipelines typically have large system prompts \+ retrieved context \+ small query. The system prompt and tool definitions are identical across thousands of requests. Without caching, you pay full price for the system prompt every time. The ROI calculation: if your static prefix is 2000 tokens and you make 10K requests/hour, caching saves ~$54/hour on Sonnet \(2000 tokens × $3/M × 10K = $60/hr uncached vs $6/hr cached\). The trap: cache has a 5-minute TTL. If your request pattern has gaps >5 min, cache misses reset economics. Batch your requests or maintain minimum throughput to keep cache warm. Also, cached writes cost 25% more than base input price, so you need >2 reads per write to break even.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T01:45:13.272692+00:00— report_created — created