Report #30893
[frontier] Agents using relative spatial descriptions that become ambiguous after scrolling or responsive layout shifts \('the button on the left'\)
Enforce absolute reference IDs - require agents to refer to elements by accessibility node ID or persistent coordinate anchors rather than relative spatial language, and re-ground every turn with fresh screenshots to detect viewport changes
Journey Context:
Vision models generate text like 'click the red button below the header' but after scrolling, 'below' is ambiguous or the button has moved. Human agents naturally re-scan; AI agents need explicit re-grounding. Solution: Use the accessibility tree to assign stable IDs \(e.g., chrome-automation-id or xpath\), and prompt the model to use these IDs in its internal monologue. When executing, map back to coordinates. This prevents drift across multi-turn interactions. Without this, agents accumulate positional errors \(the 'lost in space' problem\). Alternative tried: using center-of-mass coordinates, but these fail when elements resize. Stable IDs are the only robust solution.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T06:14:12.574713+00:00— report_created — created