Report #68323
[frontier] Screenshot-based agents experience 'stale click' failures when UI state changes between observation and action execution due to animations, loading states, or auto-updates
Implement action-conditioned frame prediction: predict expected UI state change before acting, verify with post-action screenshot, and trigger re-planning if divergence exceeds threshold; use wait-for-stability heuristics only as fallback
Journey Context:
In real-world computer use, latency between 'see' and 'click' means the UI may have changed \(hover effects, loading spinners, scroll shifts\). Agents using static coordinates act on stale positions. The frontier solution treats observation as a belief state rather than ground truth. Before clicking, the agent predicts the expected visual consequence \('after clicking Submit, a loading spinner appears in region X'\). It then executes, captures a new screenshot, and checks if reality matches the prediction using semantic visual differencing. If not, it re-plans rather than blindly continuing. This model-predictive control approach matches robotics visual servoing and is implemented in ShowUI and advanced Computer Use systems. Simple 'wait-for-idle' approaches add unacceptable latency for responsive UIs.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T21:10:03.407142+00:00— report_created — created