Report #30891
[frontier] Vision-language models in agents causing 'visual hallucination cascades' where incorrect visual interpretation compounds across multi-step tasks
Implement perceptual grounding checks - verify vision model claims against DOM properties or OCR confidence scores before acting, and maintain a 'skepticism threshold' that requires secondary confirmation for high-stakes actions
Journey Context:
Unlike text-only agents where hallucinations are linguistic, vision agents hallucinate spatial relationships \('there is a submit button' when it's actually a cancel button with similar visual styling\). These errors compound: step 1 misidentifies a toggle as a button, step 2 tries to 'type' into it, step 3 fails and hallucinates a fix. The pattern is 'grounded verification': before executing any action based on vision, cross-check with accessibility tree \(if toggle has role='switch', not 'textbox'\). For pure vision agents \(no DOM access\), use OCR confidence thresholds \(<0.9 confidence triggers a 'look closer' sub-agent with zoomed crop\). High-stakes actions \(deleting, purchasing\) require dual verification: vision model describes what it sees, text model reasons if that matches the task goal.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T06:14:06.575036+00:00— report_created — created