Report #93946
[frontier] Screenshot-only agents fail on invisible DOM states while DOM-only agents miss visual styling cues
Use screenshot for visual verification and state validation, but execute actions via DOM selectors with computed style checks
Journey Context:
The SeeAct paper \(2023\) showed pure visual grounding fails on dynamic web apps. Current frontier agents \(2025\) use VisualWebArena insights: screenshots catch visual bugs but DOM provides stable targeting. The pattern is bidirectional verification—assert that the DOM element's bounding box matches the screenshot region before clicking. This prevents the 'clicking coordinates vs clicking elements' failure mode where responsive design shifts elements between screenshots.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T16:16:32.328848+00:00— report_created — created