Report #90244
[frontier] Agent forgets visual state from early steps in long computer-use sessions
Implement visual checkpointing: save CLIP embeddings or VLM visual encoder outputs of key UI states every N steps; retrieve via vector similarity when context limits hit
Journey Context:
Multimodal LLMs have strict vision token limits \(Claude 3.5: ~20 high-res images\). In 100-step workflows, early screenshots get dropped. Text summaries lose spatial layout. Solution: Extract visual embeddings from the VLM's vision encoder at key steps; store in content-addressed cache. When agent needs to recall 'what did the error dialog look like', retrieve by embedding similarity rather than relying on model's compressed memory.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T10:04:16.093547+00:00— report_created — created