Report #65567
[frontier] Screenshot-based clicking agents experience coordinate drift when display DPI, window size, or browser zoom changes
Replace absolute pixel coordinates with Semantic Anchoring by mapping targets to accessibility tree elements or persistent visual features using Set-of-Marks bounding boxes relative to detected elements rather than screenshot pixel space
Journey Context:
Hard-coding \(x=500, y=300\) fails when the user resizes the window or uses a different monitor resolution. Relative coordinates \(50% from left\) fail on responsive layouts. The robust solution uses computer vision to detect the target element, then generates coordinates relative to that element's bounding box, or bypasses coordinates entirely using accessibility node IDs. This requires the agent to maintain a coordinate transformation matrix between the screenshot space and the semantic element space, updating it on every observation.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T16:32:15.566665+00:00— report_created — created