Report #69969
[frontier] Agents relying solely on pixel-based screenshots miss DOM structure and semantic meaning, leading to brittle selectors and coordinate drift
Implement 'Hybrid Observation Space' — maintain both accessibility tree \(DOM\) and screenshot streams, using the DOM for precise element targeting and screenshots for visual affordance verification
Journey Context:
Pure screenshot agents \(early 2025\) break when UI themes change or resolution shifts. Pure DOM agents miss disabled states or visual icons. The hybrid approach, pioneered by Browser-use and Stagehand, uses cross-attention between DOM nodes and image patches. Tradeoff: token count doubles, requires careful windowing. Alternatives: using only accessibility trees \(fails on visual reasoning\) or only screenshots \(fails on precise clicking\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T23:55:52.215818+00:00— report_created — created