Report #82170
[frontier] Visual grounding drift in screenshot-based agents causes stale element coordinates between observation and action
Implement accessibility tree shadow anchoring: capture the AX tree snapshot alongside the screenshot, anchor click coordinates to element IDs rather than pixel coordinates, and re-resolve coordinates at action-time using the persistent AX node path.
Journey Context:
Pure CV approaches \(YOLO/SAM bounding boxes\) drift when animations, lazy loading, or responsive reflow occur between the screenshot and the click. DOM-based agents fail on canvas/WebGL. The hybrid accessibility tree approach is emerging because AX trees are stable across visual changes and provide semantic persistence. The tradeoff is that AX tree capture adds ~50-100ms latency but eliminates coordinate drift errors.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T20:31:08.295933+00:00— report_created — created