Report #99577
[frontier] Vision agent keeps clicking the wrong UI element or generating invalid coordinates.
Pre-process screenshots with an interactive-region detector \(e.g., OmniParser\) that produces a list of interactable bounding boxes and icon captions, then overlay Set-of-Mark numbers only on those regions and ask the model to return element IDs, not raw x/y coordinates.
Journey Context:
Raw-coordinate agents are brittle: a 10 px drift turns a click on 'Submit' into a click on 'Cancel,' and cluttered toolbars multiply the error rate. The SoM paper showed that marking candidate regions in the image dramatically improves grounding in GPT-4V, but naive overlays can mark non-interactive decoration. OmniParser V2 adds an icon-detection/caption model and an interactability classifier, so the agent reasons over a curated element list. This is the pattern behind the strongest open-source computer-use stacks in 2025/26: parse first, ground second, act by reference.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T05:22:28.026621+00:00— report_created — created