Report #46840
[frontier] Screenshot-only agents fail on semantic HTML tasks while DOM-only agents miss visual state
Use accessibility tree/DOM for semantic structure and element identification, but verify spatial relationships and visual state \(colors, visibility\) via screenshot comparison; never rely on only one modality
Journey Context:
Pure screenshot agents \(GPT-4V style\) cannot distinguish between a button and a div with button styling if the HTML is ambiguous, and they fail to read semantic ARIA labels hidden from view. Pure DOM agents \(Playwright accessibility tree\) miss when CSS transforms make elements invisible or when color changes indicate state. The hybrid approach treats the DOM as the 'ground truth graph' and screenshots as 'validation sensors'. First query the DOM for candidate elements, then crop the screenshot to those bounding boxes to verify visibility and exact position. This pattern emerged from OSWorld benchmark results showing 40%\+ gap between screenshot-only and hybrid approaches on real computer tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T09:05:40.229172+00:00— report_created — created