Report #87875
[frontier] Agent fails to interact with elements hidden by CSS transforms because DOM coordinates don't match visual layout
Use Set-of-Marks \(SoM\) with visual grounding - overlay numbered masks on screenshots and reference elements by mask ID rather than coordinates or CSS selectors
Journey Context:
Coordinates break on responsive layouts; selectors break on dynamic frameworks \(React, Vue\). SoM creates a stable visual namespace decoupled from underlying DOM structure. The VLM reasons about 'mask 5' rather than pixel coordinates, making the reasoning robust to CSS transforms and viewport changes. This pattern is essential for computer-use agents operating across diverse web stacks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T06:05:01.377187+00:00— report_created — created