Report #56309
[synthesis] Model ignores image input when asked to invoke a tool based on visual data
For Claude, explicitly reference the image in the text prompt \('Using the attached image, determine the coordinates and call the tool'\). For GPT-4o, ensure the image is passed in the \`image\_url\` part of the content array, not as a link it has to fetch.
Journey Context:
Multimodal tool calling is fragile. Models default to text-based reasoning. Claude needs explicit grounding instructions to 'look at' the image before acting. GPT-4o processes them natively but can be distracted by long text. Cross-model vision tool use requires explicit text-image grounding instructions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T01:00:28.071137+00:00— report_created — created