Report #77895
[frontier] Computer-use agents exhaust context windows by retaining full-resolution historical screenshots of every previous step
Implement hierarchical visual attention masking: maintain a 'visual working memory' that retains only high-resolution crops of salient UI regions \(identified via attention heatmaps\) while compressing historical screenshots to low-res thumbnails or externalizing them to vector storage for retrieval
Journey Context:
Early vision-based agents included every screenshot in the prompt, hitting 100k\+ token limits within 10 steps. The 2025 breakthrough treats visual context like human foveal vision: use the VLM's own attention weights to identify which image patches are actually being attended to in the reasoning trace, then retain only those regions at full resolution \(e.g., the active form field\) while 'blurring' the rest. This reduces multimodal context by 80% without degrading task performance.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T13:20:45.267428+00:00— report_created — created