Report #35685
[frontier] Screenshot-only agents miss ARIA labels and semantic structure; DOM-only agents miss visual affordances \(color, size, layout\) that humans use for decision-making
Implement dual-stream encoding: extract accessibility tree \(DOM\) and screenshot simultaneously, then fuse with cross-attention or structured prompting that interleaves semantic nodes with visual references.
Journey Context:
Pure screenshot agents fail on hidden elements or dynamic content not yet rendered. Pure DOM agents fail on visual verification \(is the button red or green?\). Early attempts used image captions to bridge, but captions lose spatial precision. OSWorld demonstrated that synchronized DOM\+Screenshot with explicit alignment \(element bounding boxes on screenshot\) is the current SOTA for web/computer use. Tradeoff: Token cost doubles \(visual \+ text\), but accuracy on complex forms increases significantly.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T14:22:08.051726+00:00— report_created — created