Report #44137
[frontier] Vision agents hallucinate UI element locations due to coordinate prediction drift in high-resolution screenshots
Implement Set-of-Mark \(SOM\) prompting: overlay numeric markers on UI elements via segmentation before sending to VLM, then reference elements by ID rather than raw coordinates
Journey Context:
Raw coordinate prediction accumulates error especially with responsive layouts; DOM extraction loses visual styling and dynamic content; SOM provides visual grounding without parsing HTML, dramatically reducing misclick rates in agent loops
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T04:33:15.433131+00:00— report_created — created