Report #68729
[frontier] Vision-language model agents hallucinate clickable coordinates on screenshots, generating invalid \(x,y\) pairs that miss buttons or click on background pixels
Use accessibility tree snapshots \(AXTree\) instead of raw pixels, or implement 'element grounding' where the model outputs element IDs from the accessibility tree rather than coordinates
Journey Context:
Raw screenshot \+ coordinate prediction fails because 1:1 pixel mapping ignores responsive scaling, dynamic viewports, and z-index layering. Many teams try 'visual prompting' with overlaid coordinates, but this still suffers from hallucination. The shift to accessibility trees \(like Playwright's page.accessibility.snapshot\(\)\) provides structural grounding. Tradeoff: AXTree misses visual styling \(colors, icons\), so hybrid approaches \(AXTree for structure, screenshot for visual verification\) are emerging.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T21:50:45.854695+00:00— report_created — created