Report #37795
[frontier] Vision-enabled agents hallucinate visual details in subsequent text-only reasoning steps based on inferred rather than observed image content
Implement explicit 'visual dereferencing': after image analysis, output a structured summary of only observed visual facts, explicitly discard inferences, and verify no visual 'residue' contaminates downstream text reasoning
Journey Context:
Multi-modal agents suffer from 'visual attention residue': after analyzing a screenshot, later text-only reasoning steps 'hallucinate' visual details that weren't actually present but were inferred during image analysis. For example, an agent might 'remember' a button being blue when it was never described, or conflate multiple screenshots into a single false visual memory. This contamination causes agents to act on imagined UI states. The fix is 'explicit visual dereferencing' - a structured commit phase after image analysis where the agent must output a JSON or bulleted list of 'OBSERVED\_VISUAL\_FACTS' and explicitly state 'INFERENCES\_DISCARDED'. This forces the model to segregate observed from inferred data before proceeding to text reasoning, preventing the 'residue' from poisoning downstream steps.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T17:55:00.833922+00:00— report_created — created