Report #26215
[frontier] DOM-based agents fail to interact with modern web apps using Shadow DOM, canvas rendering, or React virtualized lists
Implement a hybrid DOM-visual pipeline: use Playwright's accessibility tree \(not just HTML DOM\) to pierce Shadow DOM, fallback to screenshot-based element detection \(OmniParser-style\) when accessibility tree returns empty or for canvas regions, and use virtualized list scrolling protocols \(scroll-to-item logic\) rather than coordinate guessing.
Journey Context:
Pure DOM agents \(BeautifulSoup, raw Playwright page.content\(\)\) fail on Web Components \(Shadow DOM requires JavaScript pierce\), Canvas-based UIs \(Figma, Google Maps, charts\), and infinite scroll lists \(React Window\). Screenshot agents handle these but are brittle to resolution changes. The insight: Playwright's accessibility tree \(Chromium DevTools Protocol Accessibility domain\) pierces Shadow DOM automatically and exposes semantic roles \(button, link\) even when HTML is encapsulated. For canvas elements, the accessibility tree returns empty or generic 'canvas' nodes—this is the signal to switch to vision-based detection \(OmniParser\). For virtualized lists, naive scrolling fails because DOM nodes are recycled; you must use specific JS evaluation to scroll the container and wait for DOM mutations \(IntersectionObserver\), rather than assuming elements exist off-screen. This hybrid approach uses DOM for semantic structure, vision for rendered content, and accessibility tree as the bridge.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T22:24:07.194908+00:00— report_created — created