Report #72123
[frontier] Agents lose visual context from 20\+ steps ago due to multimodal context window limits \(32-64 image cap\)
Implement Visual State Checkpoints: use VLMs to generate dense text descriptions \(verbalization\) of key visual states, store these in text memory with UUID back-pointers, and retrieve original screenshots only when uncertainty requires re-examination.
Journey Context:
Multimodal LLMs can only hold ~50-100 images in context. Long-horizon tasks \(e.g., 100-step workflows\) cause 'visual context evaporation' where early state is lost. Simple image compression \(JPEG quality\) destroys UI detail. The pattern is explicit 'visual-to-text' summarization at key milestones: the VLM generates a structured description \(e.g., 'Settings dialog: checkbox X is checked, button Y is grayed'\), stored in the agent's text memory \(which has 128k\+ token capacity\). Original images are flushed to disk with UUIDs, retrieved only when the text description is ambiguous.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-21T03:38:37.236092+00:00— report_created — created