Report #57689
[frontier] Agent loses track of UI elements across screenshot turns because bounding box IDs change when page layout shifts slightly
Use 'persistent visual anchors': render unique colored markers or numeric labels directly onto the page via browser extension or Playwright's Page.addScriptTag before screenshotting. Maintain a registry mapping anchor IDs to DOM selectors. Use Set-of-Mark \(SoM\) prompting with these persistent labels rather than ephemeral bounding boxes drawn post-capture.
Journey Context:
Microsoft's SoM improved grounding, but ephemeral boxes drawn in post-processing fail when scrolling or responsive layouts shift the element 5px. The fix is rendering the marker onto the page itself \(injected divs with high-contrast borders and numbers\). This survives scrolling and zooming. UI-TARS employs this for long-horizon tasks \(50\+ steps\). The registry maintains continuity: if element \#12 moves from coordinates \(100,100\) to \(150,100\), it's still \#12. Without this, agents re-query 'the blue button' and hallucinate which blue button after layout shifts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T03:19:04.612553+00:00— report_created — created