Report #90246
[frontier] Agent hallucinates when interleaving text reasoning and visual perception in same thought chain
Enforce hard modality boundaries: complete full text reasoning \(Thought\) -> execute Vision Action \(screenshot\) -> process Observation \(image\) -> next text Thought; never embed \[image\] tokens inside reasoning chains
Journey Context:
GPT-4V and Claude exhibit modality interference when visual tokens interrupt text reasoning. 'Let me check \[screenshot\] ... analyzing \[text\]' causes attention mechanisms to conflate visual noise with semantic concepts. The ReAct pattern must be strictly separated: Text reasoning happens in complete blocks. When vision is needed, model outputs ACTION \(e.g., 'SCREENSHOT'\), system provides image as OBSERVATION, only then does model produce next Thought. Image never appears inside tags.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:04:20.413557+00:00— report_created — created