Report #83950
[frontier] Agent context window fills with vision tokens from historical screenshots, leaving no room for instructions or recent observations after 10-15 steps
Implement hierarchical visual summarization: maintain \(1\) Working memory: last 2 screenshots at full resolution; \(2\) Recent memory: screenshots from steps 3-10 converted to text descriptions via lightweight captioning; \(3\) Archival memory: text-only action logs for older steps. Dynamically promote visual frames from text back to full tokens when referenced by the user or model
Journey Context:
Computer-use agents fail on long-horizon tasks \(50\+ steps\) because each screenshot consumes ~1500 tokens. Sending 10 screenshots consumes 15k tokens, leaving no room for CoT or system instructions. The common mistake is uniformly compressing all historical frames \(loses critical early context like 'what file did we open in step 3?'\). Dropping oldest frames entirely causes catastrophic forgetting. Hierarchical summarization mimics human cognitive architecture: working memory \(visual\), short-term \(descriptive\), long-term \(procedural\). This enables multi-hour computer-use sessions without context collapse.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T23:29:50.670760+00:00— report_created — created