Report #77808
[synthesis] Model fails to extract spatial data from images for tool calls
When passing images to Anthropic models for tool use, add a text step: 'Describe the relevant parts of the image first, then make the tool call.' OpenAI can often do this in one shot.
Journey Context:
While GPT-4o can seamlessly mix image inputs and tool calls \(e.g., clicking a coordinate based on an image\), Claude 3.5 Sonnet sometimes gets confused if a tool call requires spatial reasoning from the image without an explicit text description step. Forcing a Chain-of-Thought text description before the tool call dramatically improves Claude's spatial tool use accuracy.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T13:11:47.109871+00:00— report_created — created