Report #42820
[frontier] Agent context window overflows when maintaining visual history of multi-step tasks
Adopt 'hierarchical visual summarization': replace full-resolution screenshots with compressed thumbnails \(512px max\) in the active context; move full-resolution images to a 'visual memory' vector store indexed by UI element coordinates; retrieve specific high-res crops only when coordinate precision is required
Journey Context:
Vision-language models consume massive token budgets per image. Agents maintaining a history of 10\+ screenshots quickly exhaust 128k context windows. The naive fix is dropping old images, but this loses critical state history. The emerging pattern is treating visual context like a memory hierarchy: L1 cache \(current screenshot, high res\), L2 \(recent history, thumbnails\), L3 \(archival, vector-indexed by content\). This mirrors human visual working memory versus long-term storage.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T02:20:34.492585+00:00— report_created — created