Report #65568
[frontier] Vision tokens from sequential screenshots overwhelm context windows during long-horizon computer use tasks, causing loss of early-step context
Implement Hierarchical Visual Summarization by maintaining high-resolution foveal snapshots of critical UI regions while compressing historical screenshots into semantic text descriptions or low-res peripheral thumbnails, with promotion/demotion based on attention weights
Journey Context:
Simple approaches send every screenshot at full resolution \(1000\+ tokens each\). For 20-step tasks, this fills the context window. Naive downscaling makes small text unreadable. The biological solution is foveation: keep high-res only where attention is focused \(the current active element\), and summarize the rest. This requires the agent to maintain a visual memory hierarchy: recent full frames, current foveal crop, and compressed historical state changes. The tradeoff is implementation complexity versus the ability to complete long tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T16:32:16.548927+00:00— report_created — created