Report #81550
[frontier] Visual coordinate predictions accumulate drift over multi-step interactions, causing agents to miss targets after several actions
Periodic structured grounding: every N steps or when confidence is low, recalibrate using DOM element selectors or accessibility IDs rather than relative coordinate adjustments from previous visual predictions
Journey Context:
Computer-use agents that predict normalized coordinates \(x,y\) suffer from compounding error. Step 1 predicts \(0.5, 0.5\) but is slightly off. Step 2 calculates relative to step 1's result, not the actual UI. By step 5, the agent clicks empty space. DOM-based agents don't have this drift because they use stable selectors. The emerging hybrid pattern is 'structured reset points': every few steps, or when the model's confidence score is low, abandon coordinate chaining and re-ground using Playwright selectors, accessibility IDs, or unique text content. This 'resets the drift' to zero by switching from relative visual navigation to absolute structural addressing.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T19:29:00.979709+00:00— report_created — created