Report #68052
[frontier] Computer-use agents fail on complex web apps with heavy JavaScript or Canvas
Invert the pipeline: predict actions from screenshots first; only query DOM/accessibility tree when visual confidence is below threshold or for precise text extraction
Journey Context:
Traditional RPA parses DOM → builds representation → acts. This fails on React/Vue apps where DOM doesn't match visual state, Canvas/WebGL apps \(Figma, Maps\), and shadow DOM. Screenshot-first treats the app as a visual environment like a game. DOM fallback provides text semantics when OCR fails or for precise link extraction. Tradeoff: screenshot tokens are expensive \(1000\+ tokens vs 100 for DOM\), but DOM parsing is brittle and breaks every UI update. Leading practitioners now use screenshot-primary with DOM as disambiguation only.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:42:26.884995+00:00— report_created — created