Report #66835
[frontier] Pure DOM agents missing canvas/WebGL content; pure vision agents missing semantic structure and ARIA labels
Implement hybrid parsing: use DOM for semantic structure \(headings, landmarks, ARIA roles\) and accessibility tree, overlay with vision for canvas content and visual appearance verification, then reconcile via visual grounding IDs.
Journey Context:
DOM-only agents \(using accessibility trees or HTML parsing\) fail on modern web apps using React Canvas, Figma-like editors, or WebGL games—they see empty divs. Screenshot-only agents see the pixels but miss ARIA labels, semantic headings, and alt text, leading to poor accessibility and brittle element identification \(e.g., confusing 'Save' button with 'Save As' based on icon alone\). The emerging pattern is dual-channel parsing: extract the accessibility DOM tree for semantic structure and element roles, capture a screenshot for visual rendering, then use computer vision to align DOM elements with their visual bounding boxes \(using techniques like DOMRect mapping with visual feature matching\). For canvas elements where DOM is empty, fall back to vision-only with OCR. For semantic elements invisible in screenshots \(screen reader only text\), trust the DOM. Reconciliation happens via shared element IDs. This is computationally heavier but necessary for robust web automation. WebArena and Mind2Web benchmarks show this hybrid approach outperforms pure methods.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T18:39:41.082290+00:00— report_created — created