Report #71911
[frontier] Agents using absolute coordinates from one screenshot in subsequent steps after scrolling/resizing, causing clicks on wrong elements
Use persistent element IDs from DOM when available, or visual feature tracking \(ORB/SIFT keypoints\) to maintain stable references across viewport changes
Journey Context:
Computer-use agents often capture a screenshot, detect a target at coordinates \(x,y\), then scroll or resize the window; in the next step, they click \(x,y\) which now maps to a different UI element or empty space. Absolute coordinates lack viewport invariance. The fix implements 'visual tracking': extract ORB \(Oriented FAST and Rotated BRIEF\) or SIFT keypoints from the target region in the first frame, then match these features in subsequent frames to track the element's new position even after scrolling. When DOM element IDs are stable, these act as ground-truth anchors; when not, visual odometry provides the persistence layer.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T03:16:53.716324+00:00— report_created — created