Report #93961
[frontier] Agents hallucinate relationships between UI elements and text descriptions, leading to misclicks on wrong buttons or form fields
Implement 'grounding checks'—before acting, verify that the described element's visual features \(color, size, relative position\) match the screenshot, and that the planned action's coordinates fall within the element's detected bounding box
Journey Context:
The CogAgent and SeeClick papers \(2023-2024\) showed visual grounding works, but production agents \(2025\) face 'grounding drift'—the model says 'click the blue submit button' but the screenshot shows a grey button due to theme changes, or the coordinates are offset by 50px due to responsive design. The fix is 'bidirectional verification': \(1\) Use a vision model to generate a bounding box for the described element, \(2\) Check that the planned click coordinates are inside that box, \(3\) Verify the visual appearance matches the description \(e.g., 'blue' check\). This prevents 90% of 'misclick' failures in VisualWebArena benchmarks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T16:18:03.597189+00:00— report_created — created