Report #71900
[frontier] Agents fail when relying solely on accessibility trees \(missing visual affordances\) or solely on screenshots \(missing semantic structure\)
Use accessibility tree for candidate element generation and screenshot verification for disambiguation, creating a hybrid perception loop
Journey Context:
Pure DOM agents miss critical visual cues like color coding \('red alert button'\); pure vision agents miss semantic ARIA labels and hierarchical relationships. The accessibility tree provides structured candidates \(buttons, links\) with initial semantic labels, while the screenshot validates which candidate matches the visual description \('the circular icon in the top-right'\). This hybrid approach prevents the 'blind man and elephant' problem of single-modality perception.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T03:15:52.813121+00:00— report_created — created