Report #49807
[frontier] Multi-modal agents exhibit 'modality collapse' where they fixate on either text or visual cues exclusively, missing relationships that require cross-modal binding \(e.g., not connecting a highlighted text region to the surrounding paragraph\)
Explicit 'cross-modal grounding prompts' - force the agent to articulate relationships between visual regions and text entities by interleaving visual coordinates with text references in the chain-of-thought
Journey Context:
Standard prompting treats vision and text as separate modalities concatenated together. But VLs exhibit attention bias - either over-indexing on OCR text and ignoring layout, or over-indexing on visual saliency and missing text semantics. The fix is 'grounded chain-of-thought': require the model to output spatial references \[x,y\] when mentioning text entities, and text references when describing regions. This forces cross-modal attention weights to activate. This prevents the failure where an agent sees a red box around an error message but attributes the text to a different field because it didn't bind the color-highlight spatial region to the OCR text within it.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T14:05:16.664396+00:00— report_created — created