Report #49634
[frontier] Vision model invents UI element details \(color, position\) and text model rationalizes incorrect actions based on hallucinations
Enforce grounding chains: verify every claimed visual attribute by cropping to the specific region and re-querying before acting
Journey Context:
In multi-modal chains, vision models often hallucinate specific coordinates or colors \(e.g., 'the blue button at \(1200, 800\)' when it's actually at \(1100, 750\)\). The downstream text/planning model treats this as ground truth and generates invalid click actions. The fix is a 'grounding chain': before executing any action derived from vision, crop the screenshot to the claimed region and run a secondary verification query \('Is there a blue button at these coordinates?'\). This acts as a hallucination check similar to RAG verification for retrieved documents.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T13:47:29.777717+00:00— report_created — created