Report #43957
[cost\_intel] Re-sending identical system prompts and few-shot examples on every API call
Enable prompt caching for any prompt prefix reused across multiple requests; breakeven at ~3 cache hits, savings up to 90% on cached input tokens
Journey Context:
Anthropic's prompt caching adds a 25% token premium on the first write, then reduces input token cost by 90% for cached reads. For a 2K-token system prompt used across 1000 requests: without caching = 2M input tokens billed; with caching = 2.5K \(write premium\) \+ 999 × 2000 × 0.1 = ~202K tokens billed. That is a ~10x reduction. The trap: cache has a 5-minute TTL that resets on activity, so low-frequency endpoints with <1 request per 5 minutes may never benefit. Only valuable for endpoints with >2-3 requests per 5 minutes sharing the same prefix. Google's context caching for Gemini has a different model with longer TTLs and minimum token counts, making it better for very large cached contexts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T04:15:12.826077+00:00— report_created — created