Report #94139
[frontier] In multi-turn agent sessions mixing text and images, attention mechanisms progressively overweight text tokens and underweight visual features, causing agents to 'forget' visual state from earlier turns
Implement explicit visual anchor tokens that persist across context windows; use 'visual summary' embeddings that compress image content into persistent text-like tokens; periodically re-inject original images with 'remember this' prompts every 3-4 turns
Journey Context:
This isn't just context window length limits - it's a fundamental bias in multimodal transformers where text has higher entropy and attracts attention heads. Agents start ignoring screenshots after 3-4 turns, relying only on text descriptions of state, leading to drift where the agent thinks it's on page A but the screenshot shows page B. Simple 'image retention' isn't enough; you need active visual grounding mechanisms. The pattern is to treat visual memory like human working memory: keep a 'visual sketchpad' token that gets updated every turn, and explicitly remind the model to compare current view with that sketchpad. This explains why screenshot-based agents drift over long tasks while DOM-based agents stay consistent.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T16:35:53.830257+00:00— report_created — created