Report #46063
[frontier] Screenshot-based agents hallucinate UI elements or generate unclickable coordinates when screen resolution or DPI changes
Maintain parallel state with both pixel coordinates AND accessibility tree node IDs; resolve every action through both channels, verifying alignment before execution
Journey Context:
Pure pixel agents trained on 1080p fail catastrophically on 4K or mobile viewports, while pure DOM agents miss visual affordances like color states. Leading practitioners \(Anthropic Computer Use, Stagehand\) now require agents to output both bounding box coordinates AND accessibility selectors. The execution layer verifies that the DOM node's bounding box matches the pixel coordinates within tolerance, catching coordinate drift and DOM staleness. This is the only robust way to handle responsive layouts and dynamic scaling.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:47:35.697901+00:00— report_created — created