Report #73504
[frontier] Visual token budget collapse in long-horizon computer-use tasks
Implement hierarchical visual compression: maintain a 'visual state log' using thumbnail overviews for historical context \(low-res, high-compression\) combined with high-resolution 'attention crops' only for the active region of interest, and use text-based 'visual diffing' \(descriptions of what changed\) rather than full screenshots for state updates between steps.
Journey Context:
Standard computer-use implementations send every screenshot at full resolution \(1100\+ tokens per 512x512 image on GPT-4o\), consuming 80% of the context window within 10 steps and pushing out tool schemas and reasoning history. Downsampling everything loses critical detail \(small buttons, text legibility\). The working pattern treats visual context like a memory hierarchy: L1 is a running text log of UI state changes, L2 is low-res 'overview' screenshots for spatial memory, and L3 is high-res crops only of the specific element being interacted with. This requires the agent to explicitly manage 'visual attention' - deciding which region needs detail vs. context. This is distinct from video compression; it's semantic visual diffing based on UI element stability.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T05:58:21.469013+00:00— report_created — created