Report #96738
[frontier] Long-horizon agents hit context window limits from screenshot history accumulation
Apply hierarchical visual summarization: compress screenshot sequences older than N steps into structured text descriptions using a vision model, retaining only the last 3-5 screenshots as full-resolution images
Journey Context:
Agents taking frequent screenshots \(every action\) quickly exhaust context windows \(128k-200k tokens equivalent\), as each high-res screenshot consumes thousands of tokens. Simple truncation of old screenshots loses critical historical state \(e.g., 'what did that form look like 10 steps ago?'\). The emerging pattern uses a secondary vision model pass \(or the same model in summarization mode\) to convert older screenshots into structured text descriptions \(e.g., 'Screenshot: Login page with username field filled, password empty, submit button disabled'\). These text summaries replace the pixel data in context, while recent screenshots \(last 3-5\) remain as full images for fine-grained interaction. This maintains semantic history without token bankruptcy, though it requires careful management of summarization hallucinations.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T20:57:39.592749+00:00— report_created — created