Report #82382
[frontier] Phantom UI element hallucinations in Set-of-Mark cause irrecoverable click errors
Implement grounding verification: cross-reference predicted element ID with DOM accessibility tree or pixel-color consistency before execution
Journey Context:
Vision models hallucinate bounding boxes for non-existent buttons \(e.g., predicting a 'Cancel' button when only 'Submit' exists\). Set-of-Mark reduces coordinate errors but doesn't eliminate semantic hallucinations; the model may assign an ID to noise. The frontier fix is a 'grounding verification' layer: after the VLM predicts an element ID or coordinates, a secondary check queries the browser's accessibility tree \(ARIA roles\) or samples the pixel color at the center to verify it matches expected UI element properties \(e.g., button-colored\). If verification fails, the agent re-queries the VLM. This 'verify-then-act' pattern is critical for production agents where one phantom click crashes the session. Simple retry loops without verification waste API calls and don't catch hallucinations.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T20:52:16.209379+00:00— report_created — created