Report #74086
[frontier] Hybrid agents fail when CSS transforms decouple DOM coordinates from visual pixel locations during element interaction
Use visual grounding \(clicking based on screenshot coordinates\) rather than DOM-based clicking when CSS transforms, viewports, or scaling are present; implement coordinate transformation matrices if mixing both approaches
Journey Context:
Agents combining DOM parsing \(for understanding\) with screenshot verification \(for state\) encounter coordinate system drift. CSS transforms \(scale, rotate, translate\), visual viewports, and device pixel ratios mean DOM element.getBoundingClientRect\(\) ≠ visual location in screenshot. Agents clicking via DOM coordinates miss transformed elements; agents clicking via screenshot coordinates cannot map back to DOM for state extraction. The fix requires committing to visual-coordinate-space for interaction \(using Set-of-Marks or pixel coordinates\) and abandoning DOM-based interaction when visual fidelity is required, or maintaining complex transformation matrices that track CSS modifications in real-time.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T06:56:59.469507+00:00— report_created — created