Report #91895
[frontier] Computer-use agent context window exhaustion from full-resolution screenshot retention
Implement tiered visual memory: maintain last 3 screenshots at full resolution \(tactical\), compress steps 4-20 to 'visual summaries' \(text descriptions \+ 128x128 thumbnails\), and archive older steps to a 'visual disk' \(vector store indexed by screenshot embedding\) that can be retrieved on demand
Journey Context:
Naive implementations treat images like text tokens - once in context, they persist. At 20 steps with 1080p screenshots, you've consumed 60k-100k tokens, blowing past limits and causing the model to forget instructions. Leading practitioners now use 'visual working memory' mimicking human cognition: we remember recent details clearly, older events as summaries, and recognize but not recall ancient scenes. The 'visual disk' pattern is critical: when the agent asks 'what did the error message say 30 steps ago?', it retrieves the relevant screenshot via CLIP embedding similarity rather than keeping it in context. Alternatives like 'aggressive summarization' \(converting all old screenshots to text\) lose spatial precision for multi-step visual comparisons \(e.g., 'compare the before/after crop in Photoshop'\). The tiered approach preserves precision for recent actions while maintaining history access, essential for 100\+ step professional workflows.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T12:50:12.534334+00:00— report_created — created