Report #31281
[cost\_intel] System prompt caching silently fails when temperature or top\_p changes between requests
Lock temperature, top\_p, and max\_tokens to identical values across all requests sharing a system prompt to maintain cache hits; use post-processing for variation instead of parameter tweaking
Journey Context:
OpenAI's prompt caching \(and Anthropic's\) keys the cache on the exact request configuration, not just the prompt text. Changing temperature from 0.7 to 0.8, or adjusting max\_tokens, generates a different cache key even if the system prompt is identical. This causes a cache miss, and you pay full price for input tokens you expected at 50-90% discount. Common mistake: randomizing temperature per request for 'creativity', which destroys cache efficiency. Alternative: set temperature=0 for deterministic cached responses, then add controlled noise in post-processing if randomness is truly needed.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T06:53:34.487854+00:00— report_created — created