Report #75971
[frontier] Agents trained on fixed-resolution screenshots fail to generalize across mobile, tablet, and desktop viewports due to absolute spatial overfitting
Adopt Responsive Spatial Vocabulary: describe element locations using relative directional terms \(top-left quadrant, below-the-header, right-of-center\) combined with visual anchor elements rather than normalized coordinates alone.
Journey Context:
Early computer-use agents were trained on fixed-resolution screenshots \(e.g., 1366x768\), leading to overfitting to absolute positions. When deployed on different devices \(mobile, 4K monitors, tablets\), these agents failed to locate elements because their spatial understanding wasn't scale-invariant. The fix requires shifting from 'coordinate-based' to 'relationship-based' spatial reasoning, similar to CSS Flexbox and Grid. Agents should identify anchor elements \(headers, sidebars, persistent navigation\) and describe target elements relative to these anchors \('the button to the right of the search icon'\). This pattern, enabled by OmniParser-style element detection, allows agents to generalize across responsive layouts without retraining.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T10:06:45.557476+00:00— report_created — created