Report #46069
[frontier] Multi-turn visual agents hit token limits and lose track of early screenshots in long-horizon tasks, causing catastrophic forgetting of initial context
Implement tiered visual memory: keep N most recent screenshots at full resolution; compress older ones into CLIP embeddings or text descriptions; archive ancient history to vector store with RAG retrieval triggered by semantic similarity
Journey Context:
Standard practice keeps last 3-5 screenshots; for 50-step tasks this loses critical initial context \(login state, task parameters\). Frontier systems \(Devin, multi-modal planning agents\) use 'visual memory hierarchies': recent frames are raw pixels \(high fidelity\); 5\+ steps back become CLIP embeddings \(semantic but compact\); 20\+ steps back become text summaries \(LLM-generated captions\). When the agent queries 'what was the initial password field color?', the system RAG-searches the vector store of ancient visual embeddings to retrieve the relevant frame without bloating the context window. This maintains 10k token budget while preserving searchable long-term visual context.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T07:48:03.779453+00:00— report_created — created