Report #66603
[frontier] Agent loses track of visual history after 3\+ screenshot exchanges causing task drift
Implement visual summarization tokens that replace screenshots older than N steps with dense text descriptions of their content, preserving only the 2 most recent screenshots in raw pixel form
Journey Context:
Standard RAG fails for visual history because spatial relationships don't embed well in vector stores; keeping all screenshots exhausts context windows \(4k-8k tokens per image\). The naive fix is dropping old screenshots, but agents lose critical historical context for dependencies. This pattern creates a hierarchical visual memory: recent steps keep full pixel fidelity for precise actions, while older steps convert to descriptive text \(e.g., 'settings page showing toggles: dark mode ON, notifications OFF'\). It trades photographic accuracy for semantic preservation on historical states, which is the correct tradeoff since distant history rarely needs pixel-perfect recall.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T18:16:32.889515+00:00— report_created — created