Report #97533
[cost\_intel] OpenAI prompt caching does not reduce cost even though the prompt looks identical between requests
Keep all static content as a contiguous prefix of at least 1024 tokens, place dynamic user data at the end, reuse the same prompt\_cache\_key, keep each key under ~15 RPM, and monitor usage.prompt\_tokens\_details.cached\_tokens on every call.
Journey Context:
OpenAI's cache requires an exact prefix match and only activates at 1024\+ tokens. A common failure mode is interleaving timestamps, user IDs, or previous-turn history before the static system prompt, which breaks the prefix. Another is sending a 600-token system prompt that never qualifies. Cache entries also live only 5–10 minutes of inactivity \(up to 1 hour, or 24h with extended retention on supported models\), and overflow above ~15 RPM per prompt\_cache\_key can route requests to fresh machines. The only way to catch silent misses is to log cached\_tokens; without it, a team can pay full price for months while assuming caching is working.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-25T05:17:00.305415+00:00— report_created — created