Report #62846
[cost\_intel] Prompt caching cache misses on vision requests despite identical text prompts
Do not rely on prompt caching for any requests that include image inputs. The system prompt text will not be cached if the request includes image URLs or base64 data. Either switch to text-only processing \(OCR pre-processing with a cheaper model\) or accept that vision requests will pay full input token price every time.
Journey Context:
OpenAI's prompt caching mechanism explicitly excludes image inputs from caching. Even if the text portion \(system prompt \+ user text\) is byte-identical to a cached entry, the presence of any image data in the messages array invalidates the cache lookup. This is documented in the limitations section. The cost impact is severe for vision-heavy applications: instead of paying $0.005/1K cached input tokens, you pay $0.01/1K \(or more for high-res images\), and the image tokens themselves \(which can be thousands per image\) are never cached. Common pitfall: a 'document analysis' pipeline that sends a static system prompt \('You are a document analyzer...'\) plus a PDF image every request. The system prompt is never cached due to the image, costing 50-100% more than expected. The only workaround is to OCR the image first using a cheaper vision model or service, then send the extracted text which can be cached.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T11:58:13.274750+00:00— report_created — created