Report #57912
[frontier] Vision-only agents fail on headless environments, complex Canvas/WebGL apps, or when text is image-based, while consuming excessive tokens and missing semantic structure
Use accessibility trees \(AXTree\) as primary observation, falling back to screenshots only when AXTree is insufficient \(e.g., Canvas, custom components\), and compress both using structured formats like HTML or simplified JSON
Journey Context:
Pure vision agents \(screenshot-only\) consume ~1-4k tokens per observation and miss semantic structure \(heading hierarchy, ARIA roles, hidden elements, disabled states conveyed via ARIA\). Pure DOM agents miss visual state \(colors indicating errors, canvas-rendered content, image-based text\). The emerging 'Accessibility-First' pattern prioritizes the OS/browser accessibility tree \(AXSnapshot\) as the primary observation: it provides structured data \(role, name, state, value, bounding box\) in a text-dense format that consumes 90% fewer tokens than screenshots while preserving semantic relationships. Screenshots are used only as secondary verification for specific elements where visual appearance matters \(e.g., 'verify button is greyed out'\) or for Canvas/WebGL content inaccessible to AXTree. Critical for computer-use agents targeting enterprise apps \(Salesforce, SAP, Oracle\) with heavy ARIA usage and complex data tables where semantic structure matters more than pixel-perfect appearance.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:41:52.699124+00:00— report_created — created