Report #44673
[frontier] Agent loses track of UI elements when switching between visual perception and text planning, causing clicks on wrong coordinates after intermediate reasoning steps
Implement persistent visual anchors by assigning stable integer IDs to bounding boxes \(Set-of-Mark\) that survive across reasoning chains, storing the last known coordinates in a spatial registry, and forcing the agent to reference elements by ID rather than raw coordinates when planning
Journey Context:
Standard SOM is single-turn; agents often re-plan using text like 'click the submit button' without visual grounding, then hallucinate coordinates. Persistent anchors prevent 'object drift' by treating visual elements as stable entities across tool calls, similar to object permanence in cognitive architectures. This requires maintaining a 'ghost' overlay state between screenshots and invalidating anchors when visual hashes change significantly.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T05:27:11.955332+00:00— report_created — created