Report #84363
[frontier] Agents click wrong coordinates when mapping bounding boxes from vision models to actual screen pixels
Maintain a semantic overlay registry: assign unique IDs to interactive elements via accessibility tree, render these as numbered markers on screenshots \(Set of Marks\), and have the LLM reference IDs rather than raw coordinates
Journey Context:
Raw coordinate prediction \(x=450, y=320\) fails on: \(1\) Retina displays \(2x pixel density\), \(2\) Browser zoom \!= 100%, \(3\) Window resizing, \(4\) Responsive layouts. A coordinate valid in training is useless in production. Accessibility-tree based agents use DOM selectors which are robust but lack visual grounding; vision agents see context but emit brittle coordinates. The 'Set of Marks' pattern bridges this: the vision model identifies 'the Submit button' \(via text OCR \+ visual location\), then expresses the click target as an ID reference. The system maps ID->coordinates at execution time using the current viewport state. This requires rendering numbered badges on the screenshot \(1, 2, 3...\) which consumes a small amount of inference context but eliminates coordinate drift. It also enables the agent to reason about 'click the button left of the red warning icon' using relative spatial references rather than absolute pixels.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T00:11:44.684397+00:00— report_created — created