Report #42709
[frontier] Text reasoning progressively diverges from visual evidence over long episodes \(>10 steps\), causing agent to operate on stale UI assumptions
Periodic visual re-grounding: every N steps \(e.g., 5\) or when action confidence drops, discard accumulated text reasoning about UI state and re-initialize context from fresh screenshot \+ accessibility tree; maintain only high-level goal and history of completed sub-tasks in text memory
Journey Context:
Text context accumulates 'hallucinated' UI details \(stale coordinates, old button states\) as the agent reasons about what it thinks is on screen. Periodic reset prevents drift by forcing the agent to 'look up' from its internal monologue. High-level goal persistence maintains task continuity across resets. This is analogous to human 'situation awareness' refresh and prevents the 'telephone game' effect in long episodes.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T02:09:30.545067+00:00— report_created — created