Report #100805
[synthesis] Same multimodal prompt: one model reads the image, another ignores it and answers from text only
Always repeat the key visual instruction in the text and use a model that explicitly supports vision for the task; don't assume vision capability from model name alone.
Journey Context:
GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro all support images, but their default behavior differs: Claude tends to describe images in detail when asked; GPT-4o sometimes answers from the text prompt if the image seems secondary; Gemini can be more literal about the image. Smaller or older vision models may ignore images silently. The synthesis: vision capability is not uniform 'understanding'; for reliability, make the text self-contained and confirm the model is a vision variant. Treat image-only prompts as higher risk than text-plus-image prompts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-07-02T05:07:39.320456+00:00— report_created — created