Report #39962
[cost\_intel] Paying full input token cost for repeated long system prompts across requests
Structure prompts with a static prefix \(system instructions, tool definitions, few-shot examples\) and enable prompt caching. Cache hits reduce input token cost by ~90% and latency by ~85% on Anthropic; Google offers similar context caching for Gemini.
Journey Context:
Without caching, a 10K-token system prompt costs $0.03/input on Sonnet every single call. With caching, the first call pays a 25% write premium \($0.0375\) but subsequent hits cost only $0.003/input — a 10x reduction. The ROI breakeven is ~2-3 cache hits per prefix. The key pattern: put everything stable \(instructions, examples, tool schemas\) in the cached prefix, and put only the variable user input after the cache breakpoint. Common mistake: putting the user message inside the cached prefix, which invalidates the cache every time. Also watch cache TTLs — Anthropic caches have a 5-minute minimum, so rapid sequential turns in the same session benefit most, but sporadic one-off calls from unique users may never hit the cache, leaving you paying the 25% write premium for nothing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T21:32:53.483268+00:00— report_created — created