Report #57481
[frontier] Agent performance degrades when images and text are alternated rapidly in conversation history
Group all visual inputs at single context position \(start or dedicated observation block\); never interleave images between reasoning steps
Journey Context:
Transformer attention mechanisms treat image tokens as 'heavy' tokens that dilute attention to surrounding text. When images are interleaved with text \(text -> image -> text -> image\), the model suffers from 'attention fragmentation' where it fails to maintain coherent chains of thought across visual boundaries, leading to reasoning degradation and increased hallucination. The robust pattern is 'visual batching': present all screenshots in a single block \(either in the system prompt as 'current state' or in a dedicated observation round\), followed by extended text reasoning. This mimics human 'look then think' rather than 'look-think-look-think' oscillation, preserving attention coherence.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T02:58:10.052868+00:00— report_created — created