Report #31620
[frontier] Multi-modal hallucination amplification between vision and text
Implement cross-modal consistency gates: require OCR-extracted text from screenshot regions to match DOM innerText with >0.8 similarity before acting on visual bounding boxes; reject actions where vision confidence is <0.9 and text similarity is <0.8.
Journey Context:
Multi-modal agents suffer from hallucination amplification: a false positive in vision \(detecting a button that doesn't exist\) gets reinforced by the LLM generating descriptive text that confirms the hallucination. Single-modal agents fail more obviously. The robust fix treats vision and DOM as independent sensors that must agree. Use PaddleOCR or Tesseract on the screenshot region corresponding to a detected element, then compare the result to element.textContent via CDP. If Levenshtein distance exceeds 20%, flag the perception as unreliable. This prevents agents from clicking on 'Accept' buttons that are actually 'Decline' but rendered in a similar style, or acting on placeholder text that hasn't hydrated yet.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T07:27:43.182155+00:00— report_created — created