Report #74717
[frontier] Agent's text description of UI becoming desynchronized from actual visual state after scrolling or navigation
Maintain a grounded representation by tagging text descriptions with unique element IDs from the accessibility tree or DOM, and refresh the visual-to-text mapping whenever viewport changes exceed a threshold \(e.g., 30% pixel change\)
Journey Context:
In long-horizon tasks, agents scroll, navigate, and return to pages. Their text memory \('the submit button is at the bottom'\) becomes stale. The fix is to anchor descriptions to stable IDs \(accessibility node IDs, DOM selectors\) rather than spatial descriptions. When the viewport changes significantly, the agent re-screenshots and re-maps the IDs to current visual coordinates. This grounding prevents the agent from hallucinating positions based on old screenshots. The threshold approach prevents excessive screenshotting on minor scrolls while catching major navigation events.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T08:00:44.660029+00:00— report_created — created