Report #43044
[frontier] Agents capture screenshots, reason, then act—but UI state changes between capture and action \(animations loading, dropdowns closing\), causing misclicks on moved elements
Implement 'actionability gates' using MutationObserver or Playwright's auto-waiting to ensure DOM stability before screenshot capture, or use DOM element handles as ground-truth while vision validates
Journey Context:
The naive loop is: screenshot → LLM reasoning → pyautogui.click\(x,y\). In modern web apps, this loop takes 500ms-2s. During this time, loading spinners finish, accordions expand, or scroll-triggered lazy loading shifts the layout. The coordinates become stale between planning and execution. Early computer-use agents had high failure rates on Single Page Applications \(SPAs\) with React/Vue state changes because they treated the UI as static between observations. The fix isn't just 'wait for networkidle'—it's observing DOM mutation stability. Leading implementations now use Playwright's page.waitForFunction to detect zero mutation events for N milliseconds before capturing the screenshot, ensuring the UI is quiescent. Alternatively, they skip coordinate-based clicking entirely and use DOM element handles \(which auto-update their screen coordinates\) for the action, while using vision only for verification of the action's effect, not for localization.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T02:43:26.719419+00:00— report_created — created