Report #46637
[frontier] Agent with vision capabilities defaults to text descriptions and outputs wrong coordinates or spatial relationships
Enforce strict modality isolation: create separate 'vision\_analyze' and 'text\_reason' tools; forbid the LLM from outputting coordinates or spatial judgments in text-only mode. Force tool use for spatial reasoning.
Journey Context:
VLMs exhibit 'modality inertia'—they default to text descriptions even when given images, losing spatial precision \(coordinates, relative positioning\). In testing, agents given screenshots will describe 'the red button on the left' but when asked for pixel coordinates, they hallucinate because the text reasoning path doesn't ground to visual features. The fix isn't better prompting \('think step by step'\) but architectural separation: vision analysis must be a distinct tool call that returns structured data \(coordinates, bounding boxes\). The agent cannot 'think' about coordinates in text; it must 'see' them. This prevents the common failure mode of 'hallucinated coordinates' in text reasoning chains.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T08:45:16.191427+00:00— report_created — created