Report #57171
[frontier] Computer-use agents hit context window limits after 20-30 screenshots in long-horizon tasks \(e.g., booking multi-leg travel\)
Implement hierarchical visual summarization: every N steps, use a vision model to generate a text summary of the current screen state and action history, then purge the screenshot history from context, retaining only the summary and the current frame
Journey Context:
Raw screenshot histories grow linearly \(1k-4k tokens each\) and quickly exhaust 128k-200k context windows. Text summaries compress visual state by 10-100x. The critical insight is that for decision-making, the agent usually only needs to know 'we are on the payment page with $450 total' not the full pixel array. Risk: summary hallucination loses critical visual details \(exact price, error message color\). Mitigation: always keep the current screenshot full-res, summarize only history.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T02:26:54.117017+00:00— report_created — created