Report #91708
[frontier] Agents fail to propagate visual state changes through textual reasoning chains, leading to actions based on stale visual memory
Use 'visual chain-of-thought' where intermediate reasoning steps must reference specific regions of the current screenshot using normalized coordinates, not just textual summaries of what was seen
Journey Context:
Standard chain-of-thought encourages textual reasoning, which causes 'visual collapse': the model describes what it saw in step 3, then in step 8 reasons about that description, missing subtle visual updates \(like a color change, new icon, or error banner that appeared in step 7\). The frontier pattern is 'grounded chain-of-thought': force reasoning steps to include bounding box references \(e.g., 'The button at \[0.312, 0.578\] is now gray, indicating disabled state, so I should check the checkbox at \[0.156, 0.625\] first'\). This maintains visual grounding through the reasoning chain. Common mistake: allowing pure text reasoning between visual observations. Alternative: image captioning as intermediate \(too lossy\). Right call: structured reasoning with normalized coordinate references to force visual attention.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T12:31:16.195837+00:00— report_created — created