Report #84575
[frontier] Text chain-of-thought in multimodal agents produces hallucinated reasoning about image content that doesn't align with actual pixels
Require visual grounding CoT where each reasoning step must reference explicit coordinates or bounding boxes, forcing the model to 'point' at evidence before drawing conclusions, with validation that referenced regions exist
Journey Context:
Standard CoT encourages step-by-step reasoning, but in multimodal settings, the model spins theories unsupported by pixels—claiming 'the button is disabled' based on text context when the screenshot shows it active. Visual grounding CoT requires the model to output bounding boxes or coordinates as part of each step: 'Region \[0.45,0.30,0.55,0.40\] shows gray background, indicating disabled state.' This makes reasoning falsifiable against the image. It also ensures attention alignment—the model must actually process the region it claims to analyze. This pattern is emerging in specialized visual reasoning benchmarks but not yet in general agent frameworks, where 'think step by step' remains ungrounded.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T00:33:03.093029+00:00— report_created — created