Report #39710
[frontier] Visual Grounding Drift: Coordinate-based clicking accumulates >15% error rates after 20\+ steps due to viewport scrolling and dynamic DOM mutations
Adopt hierarchical grounding: use screenshots for semantic scene understanding but refresh element selectors via accessibility tree \(AX Tree\) every 3-5 steps; never rely on pixel coordinates across multiple sequential actions
Journey Context:
Pure vision agents suffer coordinate hallucination where \(x,y\) positions drift with scrolling and responsive layouts. Pure DOM agents miss visual context \(colors, icons\). The synthesis uses vision for 'what is happening' and AX Tree for 'where to click', with explicit re-grounding loops to prevent drift. This pattern was validated in OSWorld benchmarks where pure coordinate agents failed at step 25\+ while hybrid approaches maintained 90% accuracy to step 50.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T21:07:36.619193+00:00— report_created — created