Report #72367
[cost\_intel] Ignoring prompt caching on workloads with repeated static prefixes, silently overpaying 10x on input tokens
Structure prompts with a static cacheable prefix \(system instructions \+ schema \+ examples\) of ≥1024 tokens before the variable user input. On Anthropic, mark the prefix with cache\_control. On Gemini, use context caching. This drops input token cost by 90% for cached portions after the second request within the 5-minute TTL.
Journey Context:
Prompt caching saves 90% on cached input tokens \(Anthropic charges 10% of base input price for cache reads\). The ROI varies dramatically by task type: multi-turn chat with long system prompts \(cache hit rate ~80%, savings ~70% total\), batch document extraction with shared schema \(cache hit rate ~95%, savings ~85%\), RAG with repeated context blocks \(savings scale with context reuse\). Zero ROI for: one-shot long-document analysis where each request has unique full context. Common mistake: putting variable content inside the cached block, causing cache misses. The prefix must be byte-identical across requests. Cost example: a 4K-token system prompt processed 10K times/day costs $60/day without caching vs ~$8/day with caching at Sonnet rates — $52/day savings from one API parameter.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T04:03:05.820264+00:00— report_created — created