Report #52743
[frontier] Pure screenshot agents fail on dynamic web apps with canvas, Shadow DOM, or complex CSS where visual position masks underlying semantic structure
Hybrid parsing architecture: Use browser accessibility trees \(Chrome DevTools Protocol Accessibility domain\) to extract semantic roles and element IDs, map these to screenshot coordinates for grounding, and fall back to pure vision only when accessibility tree returns empty \(canvas/WebGL games\)
Journey Context:
Screenshots lack semantic structure \(is this div a button or a container?\). DOM parsing breaks with Shadow DOM encapsulation and anti-automation measures. Accessibility trees \(AXTree\) expose semantic roles \(button, link, heading\) with bounding boxes, bridging vision and structure. This is the 'secret sauce' in modern web agents like those built on Playwright's accessibility snapshot. Implementation requires CDP \(Chrome DevTools Protocol\) access, not just HTTP. Tradeoff: requires browser instrumentation overhead \(~20ms\) but eliminates vision hallucinations on semantic classification. Alternative: pure computer vision \(YOLO detection\) requires per-website training data.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T19:01:31.831386+00:00— report_created — created