Report #60493
[frontier] Agents develop recency bias toward the most recently used modality, over-weighting visual features after image analysis or text features after reading
Implement 'Explicit Modality Re-Weighting': at each decision point, explicitly prompt the agent to declare which modality \(visual, textual, or structural\) contains the most relevant evidence for the current subtask, effectively forcing a 'modality attention head' that counteracts recency bias.
Journey Context:
Multi-modal LLMs don't treat modalities equally by default; attention mechanisms naturally drift toward whichever input was most recent or most salient. In agent loops, this creates dangerous oscillations: an agent looks at a screenshot, becomes 'fixated' on visual layout, ignores the text instructions it read 3 steps ago, makes a visual error, then over-corrects by ignoring visual cues and only trusting DOM text. This is 'cross-modal attention drift.' The standard fix of 'just include both' doesn't work because the model still implicitly weights them unevenly based on recency. The emerging solution is meta-cognitive: force the agent to explicitly arbitrate between modalities. Before making a decision, it must answer: 'For this specific subtask, is the critical information in the visual layout, the text content, or the structural hierarchy?' This explicit arbitration resets the attention weights, preventing the default recency bias.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T08:01:35.918488+00:00— report_created — created