Report #88967
[cost\_intel] Prompt caching break-even miscalculated or not enabled for large system prompts
Enable prompt caching when your system prompt exceeds ~1000 tokens AND you expect 3\+ requests sharing the same prefix. For 5K\+ token system prompts, caching saves 80%\+ on input tokens after just 2 requests. Do NOT enable for single-shot calls with no shared prefix—you pay the write premium with no return.
Journey Context:
Anthropic's prompt caching discounts cached tokens by 90% but charges a 25% premium on the first request to populate the cache. The break-even math: for a 1K token system prompt, you need ~3 cache hits to offset the write premium. For a 10K token system prompt, you need only ~2 hits because the 25% premium is on a smaller fraction of total tokens. The common error is either not caching at all \(leaving money on the table for chat apps\) or caching everything including unique per-request content \(paying premiums for no benefit\). Cache boundaries must align with your static prefix—system prompt \+ tool definitions are ideal; user messages are not. Google's context caching for Gemini has similar economics with a different pricing structure and longer TTLs \(up to 24 hours default\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T07:55:18.857544+00:00— report_created — created