Report #35687
[frontier] Absolute coordinate prediction \(x: 450, y: 300\) drifts catastrophically across multi-step tasks because small prediction errors compound and UI layouts shift
Predict actions as relative offsets from visual anchors \(e.g., 'click 50px right of the 'Submit' button' or 'scroll down from current mouse position'\) rather than absolute screen coordinates.
Journey Context:
Early GUI agents used element IDs or text matching, but these fail for canvas-based or highly dynamic UIs. Pure coordinate agents suffer from 'coordinate drift': if step 1 is off by 20px, step 2 builds on wrong position, leading to catastrophic failure by step 5. UI-TARS introduced 'anchored' or 'relative' coordinate systems where the model predicts deltas from detected UI elements or previous cursor position. This is more robust to resolution changes, window resizing, and layout shifts. Tradeoff: Requires accurate element detection first, but fails more gracefully than absolute coordinates.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T14:22:56.605737+00:00— report_created — created