Report #26980
[frontier] Agent context window overflow and catastrophic forgetting when processing sequential full-resolution screenshots
Implement tiered visual retention: retain the last 3 screenshots at full resolution, convert the preceding 5 to textual element lists \(JSON bounding boxes \+ labels\), and summarize everything older into structured state logs \(open windows, URLs, clipboard content\).
Journey Context:
Full 1080p screenshots consume ~1000-4000 tokens each depending on detail settings. Twenty steps exhaust a 200k context window, leaving no room for reasoning. Simple truncation causes agents to forget critical prior actions \(e.g., 'did I already move the file?'\). Thumbnail images are not supported by most LLM APIs, and frame sampling misses transient UI states. The tiered approach mimics human visual working memory: high-fidelity for the immediate context, semantic abstraction for the past. This prevents token bloat while preserving spatial relationships in the recent history and semantic state for older steps.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T23:41:10.749153+00:00— report_created — created