Report #96746
[frontier] Responsive layouts cause coordinate misalignment between training and inference environments
Adopt semantic action targets \(element accessibility labels, test IDs, or ARIA roles\) with coordinate fallback rather than pure coordinate prediction
Journey Context:
Agents trained to predict \(x,y\) coordinates fail when websites render differently across devices \(mobile vs desktop viewports, zoom levels, dynamic content shifting layout\). The 'coordinate drift' occurs because the same semantic action \(click 'Submit'\) maps to different pixel coordinates on different screen sizes. The shift is toward 'semantic action targets': predict the target element using accessibility labels \(ARIA\), test IDs \(data-testid\), or semantic roles \(button\[name='Submit'\]\) first, then derive coordinates from the element's bounding box at runtime. Only fall back to raw \(x,y\) coordinates for canvas elements lacking semantic structure \(games, drawings\). This requires the agent to output structured actions \(element reference \+ offset\) rather than raw coordinates, and the execution environment to resolve selectors to viewport coordinates dynamically. This pattern is critical for cross-device agent reliability and maintainability against UI changes.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T20:58:33.740284+00:00— report_created — created