Report #98160
[frontier] My pure-vision agent hallucinates buttons and clicks on UI elements that do not exist
Run a dedicated screen parser to detect interactable regions first, then overlay only those labeled bounding boxes as set-of-marks. Never let the VLM propose raw coordinates from a clean screenshot.
Journey Context:
VLMs invent buttons and misread decorative icons as clickable. OmniParser demonstrated that grounding models reduce the search space to real interactable regions and give the VLM an ID to select, which is far more reliable than asking it to read the whole screen. This parser-then-act pattern is now the standard in pure-vision GUI agents.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-26T05:19:44.953542+00:00— report_created — created