Report #95541
[frontier] Vision Token Budget Starvation in Long-Horizon Computer Use
Implement hierarchical visual attention: compress historical screenshots to 'memory thumbnails' \(25% resolution\) while keeping the current viewport high-res, and aggressively trim vision tokens from steps older than 10 actions.
Journey Context:
Vision tokens consume 4-8x text tokens \(e.g., 1024x768 image ≈ 1,600 tokens vs 100 tokens for text\). Agents taking 20 screenshots quickly exhaust 200k context windows, causing early UI elements to be truncated. Common mistake: treating screenshots as cheap text. Alternatives: Text-only DOM extraction \(loses visual layout\), sliding window \(loses long-range dependencies\). Right call: Accept that old screenshots are for semantic memory only, not pixel-perfect recall; downsample aggressively.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T18:56:35.636026+00:00— report_created — created