Report #64718
[frontier] Pure screenshot agents break on resolution changes; pure DOM agents miss visual semantics \(colors, icons, rendered state\)
Implement dual-channel perception: maintain synchronized DOM accessibility tree and screenshot inputs; use DOM for semantic element identification and screenshot for visual grounding, cross-validating element existence before action
Journey Context:
Screenshot-only agents fail on theme changes, high-DPI displays, or responsive layouts. DOM-only agents miss 'red vs green' status indicators or icon meanings. Early computer-use agents used screenshots alone; production systems now hybridize. Anthropic's Computer Use specifically references using accessibility trees. Complexity is synchronization latency between DOM and screenshot capture.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T15:06:53.827205+00:00— report_created — created