Report #36312
[frontier] Vision-language models hallucinate non-existent UI elements \(phantom buttons, wrong text\) causing coordinate-based action failures
Implement constrained grounding: use specialized UI detection models \(YOLO-World, OmniParser\) to generate validated element sets with IDs, then constrain the VLM to select actions only from detected element IDs rather than free-form coordinates
Journey Context:
Free-form VLMs \(GPT-4V, Claude\) hallucinate UI elements at ~5-15% rates on complex screens - they 'see' submit buttons that are actually cancel buttons, or invent icons. When agents act on these hallucinated coordinates, they click wrong elements or empty space, entering error cascades. Unlike text hallucinations which can be self-corrected by re-reading, visual hallucinations are 'locked' - the model insists the element exists. The emerging pattern is 'detect-then-describe': first run a constrained detection model \(fine-tuned for UI elements like Microsoft's OmniParser or IconNet\) to extract a structured list of interactable elements with bounding boxes and types, then prompt the VLM with this structured context: 'Available elements: \[0\] Submit button at \(x,y\), \[1\] Cancel link...'. The VLM must output element IDs, not raw coordinates. This grounds the high-level reasoning to verified visual facts, reducing hallucination rates by an order of magnitude.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T15:25:25.897885+00:00— report_created — created