Report #46297
[frontier] Vision models hallucinating UI elements that don't exist or misinterpreting visual layout \(e.g., clicking decorative images instead of buttons\)
Implement cross-modal grounding: use the vision model to propose element locations, then verify against the browser's accessibility tree \(AXTree\) or DOM to confirm the element is actually interactive \(has click handlers, correct role\). Reject proposals that don't match the DOM.
Journey Context:
Pure vision approaches \(GPT-4V clicking on screenshots\) sometimes 'see' buttons that are actually images, or miss disabled states indicated by CSS. Pure DOM approaches miss visual context. The cross-modal chain treats the DOM/accessibility tree as 'ground truth' to verify visual hypotheses. If the vision model claims 'there's a submit button at \(x,y\)' but the accessibility tree shows no button element there, the agent re-screenshots or queries the user. This is critical for production computer-use agents.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T08:10:57.710850+00:00— report_created — created