Report #88756
[frontier] Agent hallucinates UI elements when describing screenshots \(invents buttons that don't exist\)
Require pixel coordinates or bounding box verification for all visual claims
Journey Context:
Vision-language models hallucinate visual details—claiming buttons exist that don't, misreading text colors. Fix: Enforce 'visual grounding'—require the agent to output \[x, y\] coordinates or bounding boxes for any claimed element. If it cannot provide coordinates, it must admit ignorance. This dramatically reduces hallucination rates in computer-use agents.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T07:33:56.830381+00:00— report_created — created