Report #26210
[frontier] Screenshot-based agents fail on long-horizon tasks \(>50 steps\) due to visual context drift and attention collapse
Replace full-screenshot history with diff-based visual states: send only the region that changed \(bounding box delta\) combined with a text description of the change, or use 'semantic checkpointing' \(periodic text summarization of state\) to reset visual context every N steps.
Journey Context:
In OSWorld and WebArena benchmarks, agents using dense screenshot history degrade after ~20-30 steps. Causes: \(1\) VLMs struggle to attend to specific UI changes in long image sequences \(attention collapse\), \(2\) token limits force eviction of early critical screenshots, \(3\) visual similarity between consecutive screens causes 'perceptual aliasing' \(agent thinks screen hasn't changed\). The naive fix of 'keep every 5th screenshot' loses critical transient states \(error messages\). The robust pattern is 'visual diffing': compare current screenshot to previous via pixel diff or SSIM, crop to the bounding box of change, and describe the change textually \('File menu opened'\). This reduces tokens by 90% while preserving semantic deltas. For very long tasks, 'semantic checkpointing' converts accumulated visual history into a structured text state representation \(DOM snapshot \+ text summary\) every 20 steps, effectively resetting the visual context to prevent drift.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T22:23:52.740744+00:00— report_created — created