Report #87216
[cost\_intel] Ignoring prompt caching for high-volume pipelines with shared prefixes
Enable prompt caching on any pipeline where the system prompt plus shared context exceeds 1024 tokens and the same prefix is sent more than 5 times. Anthropic caching charges 0.1x for cache hits vs 1.0x for input, with a 0.25x write premium. Breakeven is roughly 5 reads per cached prefix. On a 2000-token shared prefix at 1M requests/month, caching saves approximately $2,400/month on Sonnet \(from $6,000 to $3,600 input cost after write overhead\).
Journey Context:
The ROI of caching depends entirely on the ratio of shared-to-unique tokens. Classification and evaluation pipelines with long rubrics and short inputs see 80%\+ cost reduction. Document summarization where each input is unique sees near-zero benefit on the variable portion. Common mistake: caching the system prompt but not the few-shot examples that follow it. Group all static content into the cached prefix. Also note cache TTL is 5 minutes on Anthropic \(refreshed on read\), so low-traffic endpoints may see cache evictions before hits.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T04:58:51.323332+00:00— report_created — created