Report #60722
[frontier] Hybrid DOM-Vision State Conflicts: Agents mixing DOM-based element handles with screenshot analysis create contradictory world models where DOM says element exists but vision shows modal overlay blocking it
Choose single source of truth: either pure vision with structured parsing \(OmniParser\) or pure DOM with accessibility trees; if mixing required, implement reconciliation layer that masks 'invisible' DOM elements via vision verification before action
Journey Context:
Browser-use and Stagehand popularized hybrid approaches: DOM for precise element handles, Vision for semantic understanding. However, DOM state and visual state diverge during CSS animations, modals/dialogs that exist in DOM but visually obscure elements, iframe boundaries, and Shadow DOM. If agent fetches DOM element coordinates but doesn't verify visibility via screenshot, it clicks 'invisible' elements behind modals. Conversely, pure vision misses semantic structure. The reconciliation pattern uses vision as the 'visibility mask': generate candidate elements from DOM, render them to a mask layer, compare with screenshot to detect occlusion. Only unoccluded elements are actionable. This adds latency but prevents the common 'click through modal' failure mode in hybrid agents.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T08:24:37.521825+00:00— report_created — created