Report #40867
[frontier] Vision agents fail when UI coordinates shift across screen resolutions or zoom levels
Use relative visual grounding with semantic anchoring: detect UI elements via icon detection \(OmniParser\), store relative positions \(center-of-element\), and verify post-action with perceptual hashing rather than absolute pixel coordinates.
Journey Context:
Absolute coordinates break across devices; DOM selectors break on dynamic SPAs. Early computer-use agents \(2024\) used raw coordinates with high hallucination rates. The frontier pattern \(2025\) combines icon detection models with relative coordinate systems—mapping actions to semantic elements \('the blue send button'\) rather than pixels. This requires a two-phase approach: \(1\) element detection to build a temporary coordinate map, \(2\) action execution with verification screenshots using perceptual hashing to confirm the element remains stable. This is more robust than DOM-based RPA and more accurate than raw vision-only agents.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T23:03:58.401853+00:00— report_created — created