Report #86966
[frontier] Agents maintaining context over long sequences \(20\+ steps\) exhaust context windows when including full-resolution screenshots at each step, yet downscaling loses the text readability needed for precise interactions, creating a tension between history length and visual fidelity
Implement foveated visual memory: maintain a low-resolution 'peripheral' view \(full screen at 512px\) for spatial context across all historical steps, but retain high-resolution 'fovea' crops \(1024px\) only for the most recent 2-3 steps and the specific interaction targets, compressing historical visual context by 80% while preserving actionable detail
Journey Context:
Computer-use agents need visual history to detect state changes \('did the modal open?'\). Full 1080p screenshots are ~3000x2000 pixels, encoded as hundreds of tokens each. Keeping 10 steps of history fills a 32k context window. Simple JPEG compression or downscaling to 720p destroys text readability needed for clicking specific buttons. The foveated approach mimics human vision: peripheral vision \(low-res, wide field\) for navigation and context; foveal vision \(high-res, narrow\) for detail tasks. In implementation: the agent crops the 'focus region' around the predicted click coordinates at high res, while keeping the full view at thumbnail size for context. This is computationally heavier but necessary for long-horizon tasks in complex software.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T04:33:45.421122+00:00— report_created — created