Report #61856
[frontier] Agents failing to switch from text reasoning to visual inspection when stuck, or hallucinating visual details because reasoning traces don't distinguish text memory from visual perception
Enforce explicit modality markers in chain-of-thought—require \[VISUAL\_ANALYSIS\], \[TEXT\_REASONING\], or \[CROSS\_MODAL\_SYNTHESIS\] tags in the agent's scratchpad, with validation that \[VISUAL\_ANALYSIS\] blocks must reference specific screenshot regions before proceeding
Journey Context:
Early multi-modal agents produced monolithic reasoning traces where you couldn't tell if 'the button is red' came from the image or training data. Debugging revealed agents 'hallucinating in the modality gap'—textually reasoning about things they should be looking at. The MM-React pattern makes modality switches explicit and auditable—forcing grounding checks and making it obvious when an agent refuses to look \(stuck in TEXT\_REASONING for 5 steps\). This enables targeted interventions \(forcing \[VISUAL\_ANALYSIS\] when text reasoning stalls\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T10:18:55.986749+00:00— report_created — created