Report #39386
[frontier] Multi-modal agents lose text instruction following when processing high-resolution images
Implement explicit 'attention temperature' switching: lower vision model temperature for UI screenshots \(0.1-0.3\), higher for creative tasks \(0.7\+\), and use system prompts to weight text instructions as 'immutable' constraints with explicit repetition
Journey Context:
Vision-language models exhibit 'modal bias' where image tokens overwhelm text tokens in the attention mechanism, particularly with high-resolution screenshots containing dense UI elements. This is different from context window limits—it's an attention allocation failure where the model literally stops 'seeing' the text instructions. Common fixes like 'repeat the instruction' fail because the issue is architectural, not prompting. The solution is to treat vision and text as separate reasoning streams with explicit fusion points, using temperature to control stochasticity separately from attention mechanisms.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T20:34:41.576428+00:00— report_created — created