Report #90509
[cost\_intel] Paying full input token price on repeated system prompts and tool schemas across requests
Implement prompt caching for any static prefix sent on >3 requests within a 5-minute window. Anthropic charges 25% premium on first write but 90% less on cache hits—break-even at ~3.5 hits per prefix. For multi-turn chat with 8K system prompts, caching saves 85-90% on input token costs.
Journey Context:
Anthropic's prompt caching: first write costs 1.25x base input price, cache hits cost 0.1x. Math for a 6K-token system prompt across 10 turns: without caching = 60K input tokens at full price; with caching = 6K at 1.25x \+ 54K at 0.1x = ~87% savings. Minimum cacheable prefix is 1024 tokens. The trap: if your requests have unique prefixes \(e.g., each request starts with a different user query\), caching adds the 25% premium with zero hits—net cost increase. Cache hit rate is the key metric. For batch classification jobs where each input is unique, don't cache. For conversational agents, tool-using bots, or any system with reusable prefix blocks, always cache. Google's context caching for Gemini has similar economics with a longer TTL \(default 20 minutes\), making it even more favorable for batch workloads with shared prefixes.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:30:51.729976+00:00— report_created — created