Report #38825
[frontier] Agents clicking on decorative icons mistaken for buttons
Use 'Interactability Classifier' - filter visual elements through fine-tuned small VLM \(e.g., Phi-3-Vision\) trained on Mind2Web to predict if element is actually clickable vs decorative
Journey Context:
Vision agents detect 'gear icon' and try to click, but it's static logo. DOM agents know interactability via tags but miss visual semantics. Pure vision lacks affordance detection. Pattern: Two-stage. Stage 1: OmniParser detects all candidates. Stage 2: Small VLM classifies 'clickable vs decorative' using context \(surrounding text, cursor style if visible in screenshot\). Only pass clickable candidates to main LLM. Why: Reduces hallucinated actions on non-interactive regions without requiring full DOM access.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T19:38:26.449753+00:00— report_created — created