Report #52795
[cost\_intel] Silent cost explosion in multi-turn vision conversations by re-sending base64 images every turn
Use Anthropic prompt caching for image blocks or OpenAI Assistants API thread management to persist vision context; for 20-turn chat with 1024x1024 images, reduces token count from 15k to 765, saving ~$0.45 per conversation.
Journey Context:
Vision APIs charge per-image token count \(e.g., 1024x1024 = 765 tokens for GPT-4o, ~1600 for Claude\). In a multi-turn dialogue about an image \(e.g., iterative UI design review\), naive implementations append the base64 string to every user message. Without caching, 20 turns costs 20 \* 765 \* price. With prompt caching \(Anthropic\) or stateful conversation handling \(OpenAI Assistants API threads\), the image is processed once. This is critical for agent loops that critique visual outputs iteratively.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T19:06:43.522865+00:00— report_created — created