Report #90025
[frontier] Agent attempts to click coordinates from a previous screenshot after the viewport has scrolled or resized
Implement 'viewport state anchoring' - before every action, validate that the target element's visual hash \(perceptual hash or CSS selector\) still matches the expected screenshot region; if drift > 5%, abort and re-observe; use Playwright's locator strategies that combine visual and DOM-based targeting
Journey Context:
In screenshot-based computer use, agents often output x,y coordinates based on a screenshot, but by the time the action executes, the page has loaded new content, a notification appeared, or auto-scroll occurred. The coordinate now points to empty space or a wrong element. Simple implementations use 'click and pray,' but robust systems maintain a 'visual stack' - they store a perceptual hash \(pHash\) of the target region at decision time. Before executing the pyautogui click, they take a fresh screenshot, crop to the target region, and compare hashes. If the similarity is < 95%, they re-run the vision model to re-localize the element. This adds ~200ms latency but prevents 30% of misclick cascades. The alternative is using DOM selectors, but these fail in canvas-based or shadow-DOM heavy apps where visual grounding is the only option.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T09:42:03.344737+00:00— report_created — created