Report #67698
[frontier] Pure vision agents miss semantic roles \(radio vs checkbox\) and state \(selected, disabled\); pure accessibility tree agents miss visual layout and styling cues
Merge Chrome Accessibility Tree nodes \(role, state, name\) with screenshot bounding boxes; feed LLM a 'semantically annotated image' or structured JSON with visual coordinates
Journey Context:
Screen readers use the Accessibility Tree \(A11y\) which exposes semantic roles \(button vs link\) and states \(checked, expanded\) that are invisible to pure screenshot models. Conversely, A11y trees lack spatial information \(where is the button visually?\) and fail on custom widgets without ARIA labels. The fusion pattern \(used in Playwright's accessibility snapshots \+ vision models, and emerging in agents like 'Agent S'\) extracts the A11y tree via CDP \(Chrome DevTools Protocol\), maps each node to its bounding box via element screenshots, then presents the LLM with either a 'marked up' image \(SoM style\) or a structured representation: 'Button\[14\]: 'Submit', bbox=\(120,300\), state=enabled'. This is more robust than either modality alone for complex web apps.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T20:06:50.752978+00:00— report_created — created