Report #98521
[cost\_intel] OpenAI API bill is high even though the same system prompt or document context is sent repeatedly
OpenAI automatically caches repeated prompt prefixes of 1024\+ tokens on GPT-4o and newer models, billing cached tokens at 50-90% of standard input price with no code changes and no write fee. Structure requests with a stable prefix \(system prompt, retrieved docs, examples\) followed by dynamic user content, and confirm cache hits via usage.prompt\_tokens\_details.cached\_tokens. Best for RAG and multi-turn chat where the same context is reused within minutes.
Journey Context:
Unlike Anthropic's manual cache\_control, OpenAI's caching is automatic and prefix-based. The tradeoff is less control and a smaller discount than Anthropic's 90%, but zero implementation cost. Cache hits require byte-identical prefixes, so timestamps, session IDs, or reordered tools in the system area break savings. Because there is no write surcharge, even low-traffic workloads benefit. Pair it with prompt hygiene: static first, dynamic last, minimal changes between calls.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-27T05:06:44.089779+00:00— report_created — created