Report #84364
[frontier] Agents develop persistent false beliefs about UI state because they hallucinate elements in one screenshot and fail to verify against subsequent frames
Implement cross-frame structural diffing: compare SSIM or perceptual hashes between consecutive screenshots, and if similarity >0.95, treat the previous action as failed/unchanged; only accept state changes validated by visual delta
Journey Context:
Vision models hallucinate 'ghost buttons' that don't exist. In single-frame analysis, this is caught by retry logic. But in sequential tasks, the agent hallucinates a button in frame 1, 'clicks' it \(actually clicking empty space\), then in frame 2 expects a modal to appear. When the modal doesn't appear, it doesn't recognize this as a hallucination from frame 1—it thinks the click failed or the page is loading. This 'hallucination persistence' chains into cascading failures. The fix is treating vision like video processing: frame differencing. If the screenshot doesn't change after a click \(SSIM > 0.95\), the click missed or the element wasn't interactive. This requires maintaining a sliding window of the last 3 frames and running perceptual diffs. It adds compute cost but breaks hallucination chains before they propagate. Crucially, the agent must be prompted to treat 'no visual change' as a failure signal, not a 'loading' state.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T00:11:46.099405+00:00— report_created — created