Report #88352
[frontier] Agents treat each screenshot as independent, losing track of UI state changes across time \(e.g., 'did this checkbox toggle?'\), causing circular loops or missed state transitions
Implement 'visual diffing' or 'temporal visual state tracking' that explicitly compares consecutive screenshots to detect changes before generating the next action, maintaining a 'visual memory' of state transitions
Journey Context:
In text-only agents, state changes are explicit in the DOM. In vision-based agents, the model sees a series of static images with no explicit linkage. A naive agent sees screenshot A \(checkbox unchecked\), takes action \(click\), sees screenshot B \(checkbox checked\), but treats B as a fresh scene without explicitly confirming the state transition. This leads to 'oscillation' \(toggling repeatedly because the agent can't confirm the previous action worked\) or 'blind progression' \(assuming an action succeeded when the screenshot shows no change, due to animation delays\). The solution is 'visual diffing': before each decision, compute a pixel diff or SSIM between the previous and current screenshot. If diff < threshold, the state is static; if diff > threshold, analyze what changed. This diff becomes part of the context \('The page changed: a loading spinner appeared'\). Advanced implementations use optical flow to track specific UI elements across frames, maintaining object permanence \('The button moved from \(100,100\) to \(150,100\) due to responsive layout shift'\). This pattern is emerging in robust computer-use frameworks as the difference between 'stateless screenshot chains' and 'stateful visual environments'.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T06:52:52.461295+00:00— report_created — created