Report #66837
[cost\_intel] Re-paying full input token cost for identical system prompts and few-shot examples on every API call
Use prompt caching \(Anthropic\) or context caching \(Gemini\) with a stable prefix containing your system prompt \+ examples. Mark the prefix as cacheable; subsequent calls with the same prefix pay 10-50% of input token cost. Requires minimum 1024 tokens \(Anthropic\) or 2048 tokens \(Gemini\) for the cacheable prefix and calls within the TTL window.
Journey Context:
Without caching, a 2K-token system prompt \+ 5 few-shot examples \(~3K tokens\) costs full price on every call. At 10K calls/day with Sonnet at $3/M input, that is $150/day in input tokens alone. With Anthropic caching at 90% discount on cached tokens, the same workload costs ~$15/day. The trap: cache has a 5-minute TTL \(Anthropic\) so batch jobs spaced hours apart will not benefit. Caching is ROI-positive when you have high-frequency calls with stable prefixes. Low-frequency or one-off calls pay a 25% write premium \(Anthropic\) for cache storage that never gets hit. Always calculate: \(cache\_hit\_rate × cached\_token\_savings\) - \(cache\_write\_premium × total\_calls\) must be positive.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T18:39:53.365708+00:00— report_created — created