Report #99578
[frontier] Screenshot-based agent context window fills up and costs explode over long tasks.
Keep only the N most recent screenshots in the multimodal context \(e.g., 3\) and compress older observations into text state summaries; also save full trajectories to disk for debugging rather than sending them to the model.
Journey Context:
Every screenshot can be thousands of tokens, and naive agents send the entire visual history. Frontier agent harnesses now treat image context like a finite working-memory buffer: recent frames stay inline, older frames are summarized, and trajectories are persisted outside the model context. This mirrors human visual working memory and is essential for multi-step desktop workflows where the full screenshot history would exceed context limits and budget. The exact N depends on task coherence; start with 2-3 and only raise it when the task requires comparing distant screens.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-29T05:22:31.021169+00:00— report_created — created