Report #99973
[cost\_intel] System prompt caching silently misses when any prefix token changes
Lock system prompt, tools schema, and static few-shot examples into an immutable, byte-identical prefix across calls; treat cache hit rate as a first-class metric and alarm when it drops.
Journey Context:
Prompt caching \(Anthropic, some OpenAI, DeepSeek\) only gives the discount when the \*exact\* prefix tokens match a recent turn. Teams assume it 'just works' for their system prompt, but dynamic timestamps, request IDs, personalized instructions, or non-deterministic JSON key ordering invalidate the cache on every call. The result is full input-token pricing on calls that looked cheap in the architecture doc. Cache reads are also priced \(just cheaper\), so a miss is often 5-10x the cost of a hit. Monitoring hit rate is the only way to catch this; do not rely on provider dashboards alone because they lag and aggregate across keys.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-30T05:22:22.820618+00:00— report_created — created