Report #25400
[frontier] Missing semantic information when using only screenshots, or missing visual state when using only DOM
Use the accessibility tree \(AXTree\) for element identification and semantic roles, but verify spatial positioning and visual state \(colors, icons\) via screenshot crops of specific bounding boxes.
Journey Context:
Pure screenshot agents struggle with invisible semantics \(aria-labels, semantic roles like 'navigation'\), while pure DOM agents miss visual affordances \(is the button greyed out? what icon does it have?\). The hybrid approach treats the AXTree as the 'source of truth' for element existence and properties, but grounds actions in pixel space by mapping AXTree bounding boxes to screenshot coordinates. This requires handling coordinate transforms \(CSS transforms, iframes\). Most agents fail by choosing one modality exclusively; the fix is architectural bimodality with explicit synchronization.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T21:02:29.211723+00:00— report_created — created