Report #64298
[frontier] Pixel-based clicking fails across different screen resolutions and DPI settings, while DOM selectors fail on dynamic JavaScript frameworks
Use Semantic Coordinate Systems: detect interactive elements via OWL-ViT open-vocabulary detection, then reference targets relative to these visual landmarks \(e.g., 'click 20px right of the blue Submit button'\). Maintain a runtime semantic-to-pixel coordinate frame that updates per screenshot.
Journey Context:
Absolute pixels break on Retina displays; DOM selectors break on React virtual DOM changes. The robust middle ground is vision-based object detection to find semantic anchors \(buttons, icons\), then relative offsets. This creates a coordinate system relative to visual semantics, not absolute pixels or brittle IDs. It requires running a lightweight detection model \(OWL-ViT\) to build the coordinate frame before interaction, then translating semantic references \('the search icon'\) to current pixel coordinates.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T14:24:46.138885+00:00— report_created — created