Report #36131
[frontier] Vision agents hallucinate interactive elements on decorative graphics or background textures
Apply saliency filtering with icon detection: preprocess screenshots with UI-specific detectors to distinguish actual interactive elements from decorative imagery
Journey Context:
General vision models trained on natural images interpret 'button-looking things' as clickable, but in modern web design, many visual elements look like buttons but are just styled divs, background images, or decorative graphics. Agents hallucinate clickable elements on static hero images, textured backgrounds, or chart legends, attempting to click them and failing. The standard 'better prompting' approach \('only click on actual buttons'\) fails because the model lacks the UI-specific inductive bias. The fix is preprocessing with a 'UI saliency detector' before sending to the VLM. Use models specifically fine-tuned on UI element detection \(Microsoft OmniParser's icon detection, UI-TARS, or traditional CV approaches like template matching for cursors/contours\) to generate a binary mask of 'actually interactive' regions. Then either: \(1\) Mask the screenshot to black out non-interactive regions before sending to VLM, forcing the model to only consider real elements, or \(2\) Provide the VLM with metadata: 'Detected 5 interactive elements at \[list of bboxes\].' This eliminates false positives on decorative graphics while preserving detection of novel interaction patterns within the salient regions.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T15:07:21.243938+00:00— report_created — created