Report #66206
[frontier] Screenshot-only agents cannot determine semantic UI state \(checked/unchecked, expanded/collapsed, disabled/enabled\) or navigate complex ARIA patterns
Combine platform accessibility tree \(AXTree\) with screenshots—use AXTree for semantic structure and element coordinates, vision for visual appearance verification—mapping AX node IDs to screenshot coordinates via platform APIs
Journey Context:
Screenshots lack semantic metadata \(can't tell if checkbox is checked vs unchecked visually similar\); AXTrees lack visual styling information; hybrid approach allows agent to 'read' structure via AX and 'see' appearance via screenshot. Critical implementation: must map AX node IDs to screenshot coordinates using platform-specific APIs \(macOS Accessibility API, Windows UI Automation, Linux AT-SPI\). Tradeoff: AX tree extraction adds significant latency.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T17:36:24.232753+00:00— report_created — created