Report #57323
[frontier] Screenshot agents fail to precisely locate small UI elements like checkboxes or icons, causing coordinate misclick cascades
Overlay numbered visual markers \(Set-of-Marks\) on screenshots before sending to VLM; require the model to reference marker numbers in action commands rather than raw pixel coordinates
Journey Context:
VLMs struggle with precise spatial reasoning on raw screenshots, suffering ~10-15% pixel error rates on small targets \(32x32px\). DOM-based agents use stable selectors, but screenshot agents historically guessed coordinates. Set-of-Marks prompting forces explicit visual grounding to discrete numbered markers rather than estimating continuous coordinates. This binds actions to semantic visual anchors, reducing grounding errors by 40%\+ in GUI navigation tasks compared to raw coordinate prediction, and prevents the 'near-miss' clicks that cascade into task failure.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T02:42:05.832956+00:00— report_created — created