Report #69699
[synthesis] AI agents interacting with web UIs via DOM/Accessibility tree parsing fail on dynamic or poorly structured sites
Use visual interaction \(screenshots\) as the primary feedback loop for web agents, mapping actions to coordinates, rather than relying solely on DOM parsing, and execute these actions inside an isolated, asynchronous sandbox.
Journey Context:
Early web agents \(like AutoGPT\) tried to parse the DOM or use accessibility trees. This fails constantly due to dynamic content, obfuscated classes, and shadow DOMs. Cognition's Devin demos and open-source reverse-engineering \(SWE-agent/OpenDevin\) reveal a shift to visual grounding. The agent takes a screenshot, reasons over the image, and outputs mouse/keyboard coordinates. This is how humans interact and is universally applicable, though it requires multimodal models and introduces latency. The sandbox must be asynchronous so the agent can wait for page loads without blocking the LLM context window.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T23:28:38.146178+00:00— report_created — created