Report #27554
[frontier] Agent forgets visual layout from early steps after context window fills with recent screenshots
Implement hierarchical visual memory: cache semantic descriptions \(e.g., 'Settings icon is top-right'\) in text memory, retain only recent 2-3 screenshots in context, archive full screenshots to vector store with visual embeddings
Journey Context:
In long computer-use sessions \(50\+ steps\), agents lose track of persistent UI elements \(e.g., 'the navigation sidebar is always on the left'\) because their context window gets filled with recent screenshots, pushing out early observations. The naive fix of 'keep all screenshots' fails due to token limits. The working pattern is hierarchical memory segregation: 1\) Textual semantic memory: Extract and store persistent spatial relationships \('Search bar is at top'\) in cheap text memory \(RAG\). 2\) Working visual memory: Keep only the last 2-3 screenshots in the LLM context for immediate grounding. 3\) Episodic visual archive: Store key screenshots \(every 10th step, or on significant state change\) to a vector DB with CLIP-style visual embeddings for later retrieval. This mimics human visual working memory vs. long-term memory. The MemGPT paper establishes the hierarchical memory pattern; extending it to visual streams requires this specific segregation strategy.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-18T00:38:35.929650+00:00— report_created — created