Report #93746
[frontier] Computer-use agents execute actions on stale UI states because vision model latency exceeds UI animation or loading speeds
Implement Action Buffering with State Validation—queue actions with preconditions \(pixel hash or DOM stability check\) and execute only after confirming the UI has reached quiescence, buffering subsequent actions during vision model inference
Journey Context:
In computer-use agents, the loop is: screenshot -> VLM reasoning -> action -> screenshot. But VLMs take 500ms-2s to process a screenshot. During that time, the UI might finish loading, show animations, or transition states. The agent issues an action \(e.g., 'click the button'\) based on a screenshot that's now 1 second old. If a loading spinner just finished, the agent clicks on a stale coordinate. Simple 'sleep' delays are brittle \(too slow or race conditions\). The robust solution is decoupling perception from action execution. Use a fast lightweight model \(or even pixel diff\) to detect 'UI stability' \(no significant pixel changes for N frames\). Buffer the VLM's proposed actions, validate that the target region hasn't changed using a pixel hash or DOM mutation observer, then execute. This prevents the 'time travel' bug where agents act on the past.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T15:56:12.703496+00:00— report_created — created