Report #66236
[cost\_intel] Not using prompt caching for high-frequency repeated prefixes in production
Enable prompt caching when requests share a static prefix >1024 tokens and you make >3 requests within the 5-minute TTL. Highest ROI: conversational agents with long system prompts \+ examples, RAG with static instructions, batch classification with shared few-shot prefixes.
Journey Context:
Prompt caching writes cost 25% MORE than base input tokens, but reads cost 90% less. The breakeven is ~3 cache hits per cache write. Two common mistakes destroy ROI: \(1\) caching dynamic content that changes per request — this never hits and you pay the 25% write premium for nothing, \(2\) not warming the cache before traffic spikes, so cold-start requests all pay the write premium. The silent cost trap: if your cache hit rate is <50%, you are actually paying MORE than without caching due to the write premium. Monitor cache\_creation\_input\_tokens vs cache\_read\_input\_tokens in your usage dashboard. A well-tuned RAG pipeline with a 2K-token static system prompt and examples saves ~$1.80 per 1K requests on Sonnet — at 1M requests/month that is $1,800/month recovered.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T17:39:24.629291+00:00— report_created — created