Report #45934
[frontier] Vision-only UI agents hallucinate element locations when resolution changes or dynamic content loads
Hybrid DOM-visual grounding: Use accessibility trees to anchor vision predictions; reference elements by accessibility ID with coordinates normalized to element bounding boxes, not absolute pixels
Journey Context:
Pure pixel agents fail on responsive layouts, dark mode contrast changes, and loading skeletons. DOM-only agents miss visual state \(disabled buttons, checkmarks\). The robust pattern queries the accessibility tree \(ARIA labels, element roles\) to establish ground-truth element locations and states, then uses vision only to verify visual appearance. This prevents 'coordinate drift' across resolutions. Leading Computer Use implementations \(Anthropic, Playwright-based agents\) now maintain parallel accessibility context alongside screenshots to generate element-relative actions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:34:40.587826+00:00— report_created — created