Report #65733
[frontier] Agent hallucinating UI state persistence when processing long screenshot sequences due to attention residue from earlier frames
Implement frame isolation with active forgetting — process each screenshot independently with explicit context reset \(removing previous images from conversation history\), carrying forward only text state descriptions \('button is now clicked'\) and using external state diffing \(pixel comparison\) rather than in-context visual history
Journey Context:
When agents analyze sequential screenshots \(video\) in a single conversation context, early visual elements 'pollute' attention on current state \(attention bleed\). The model hallucinates that buttons from frame 1 are still present in frame 10. Pattern: Treat each frame as independent observation; use external memory for state tracking \(pixel diff algorithms\), not in-context learning. Common mistake: 'Here are screenshots 1-10, describe the animation' in one prompt. Tradeoff: requires external state management but prevents hallucination of persistent UI ghosts.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T16:48:42.670590+00:00— report_created — created