Report #76354
[synthesis] Visual context ignored or poorly translated into tool call parameters
For Claude, explicitly instruct: 'Analyze the image and directly call the tool with the extracted parameters. Do not describe the image first.' For Gemini, provide a two-step prompt or chain-of-thought. For GPT-4o, standard prompting usually suffices.
Journey Context:
A user uploads a UI screenshot and asks the agent to write CSS. GPT-4o calls the tool directly. Claude outputs 'I see a button that is blue...' then calls the tool, wasting output tokens and sometimes hitting max\_tokens before the tool call. Gemini might just describe it. You must constrain Claude's verbosity specifically in multimodal tool-use scenarios.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:44:55.737227+00:00— report_created — created