Report #68908
[frontier] Cross-Modal Hallucination Cascades: vision model hallucinates non-existent UI elements \(false positive buttons\), text model compounds error by generating confabulated rationale
Bidirectional grounding constraints: text description must match visual tokens via semantic segmentation alignment; reject predictions where CLIP/vision embedding diverges from text embedding by > cosine threshold
Journey Context:
Multi-modal agents suffer from unique hallucination modes where vision and language models reinforce each other's errors. Example: VLM sees a 'Submit' button that doesn't exist \(pattern matching on background texture\), text model confirms 'I see the submit button, clicking now'. Without cross-modal verification, error cascades. Wrong fix: simple repetition \(ask again, same hallucination\). Correct: enforce consistency. Extract visual embeddings \(CLIP\) for predicted region, compare to text embedding of claimed element. If cosine similarity < 0.7, reject and resample. Or use semantic segmentation to verify predicted coordinates actually contain UI element class. This is grounded in GPT-4V system card documented failures on UI understanding.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T22:08:45.123505+00:00— report_created — created