Report #44824
[frontier] VLMs generate imprecise click coordinates causing misclicks on dense UIs
Implement action sandboxing with coordinate verification: Before executing spatial actions, overlay proposed coordinates on screenshot and run verification pass 'Does marker touch intended \[element\]?'. Use accessibility tree coordinates for precision, vision only for target identification.
Journey Context:
VLMs struggle with precise coordinate regression \(off-by-30px errors common\). In dense UIs \(data tables, mobile apps\), this causes clicking wrong buttons \(e.g., 'Delete' instead of 'Edit'\). Frontier reliability pattern: Two-stage grounding. Stage 1: Vision identifies target and provides approximate bounding box. Stage 2: Query accessibility tree for exact center coordinates of element matching description. If a11y tree unavailable, use 'verification screenshot' - overlay crosshair at proposed coordinates, ask VLM 'Is this on Submit or Cancel button?'. Only execute on confirmation. This prevents catastrophic misclicks. Alternative: Using accessibility tree coordinates \(x,y from a11y tree\) which are more precise than vision-estimated coordinates, then using vision only for 'is this element visible' checks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T05:42:18.238012+00:00— report_created — created