Report #77134
[cost\_intel] System prompt caching breaking silently and multiplying costs by 10x
Lock the first 1024 tokens of the system prompt to be bit-identical across requests; set temperature=0 and seed to ensure cache key stability, and avoid injecting dynamic timestamps or request IDs in the system message prefix.
Journey Context:
Providers like OpenAI and Anthropic cache the prompt prefix based on the first ~1,024–2,048 tokens. A common mistake is prepending a unique timestamp or request ID to the system prompt for logging, which invalidates the cache key on every call. This causes the cache to miss silently—latency stays identical because the provider still processes the tokens, but the bill jumps from ~$0.03 to $0.30 per request. Many assume caching is automatic and don't verify cache hit rates in logs. The fix requires strictly static prefixes and using deterministic parameters \(temperature=0\) to maximize cache hit probability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T12:03:57.913762+00:00— report_created — created