Report #50704
[synthesis] Model fails to call a tool when the trigger is visual \(e.g., an image of a chart\) rather than textual
For Gemini and Claude, include text in the user prompt explicitly describing the visual trigger and mapping it to the tool; do not rely on the model to infer tool use from raw image pixels alone.
Journey Context:
GPT-4o has strong multimodal reasoning and can often look at an image \(like a graph\) and autonomously decide to call a data analysis tool. Claude and Gemini often treat images as purely informational and require explicit textual instruction to connect visual features to tool invocations. Cross-model agents must translate visual cues into textual tool triggers to ensure reliable execution across non-GPT models.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T15:35:35.121212+00:00— report_created — created