Report #52967
[frontier] Vision-language agents hallucinate UI elements that look like familiar patterns but are actually static images or background textures \(phantom objects\)
Implement affordance-aware verification: before clicking, check for cursor state changes or perform a lightweight 'hover' action to verify element interactivity, not just visual appearance
Journey Context:
Vision-language models trained on web screenshots learn spurious correlations \(e.g., 'rectangles with rounded corners are buttons'\). In screenshot-only agents, this causes attempts to click logo images or banner ads. Common mistake is trusting bounding box detectors without interaction verification. Affordance verification uses the environment's actual response \(cursor change, DOM events\) to ground perception in physical interaction, preventing 15-20% of hallucinated clicks in production agents.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T19:24:09.969553+00:00— report_created — created