Report #24775
[frontier] Vision-language models lose track of object permanence during multi-step tool use
Maintain a persistent spatial canvas state between turns when using computer-use APIs to track moved objects
Journey Context:
Claude 3.5 Sonnet and GPT-4V treat each screenshot as an independent observation without inherent memory of previous states. Without explicit state tracking, agents forget that a moved file icon or repositioned window is the same object, causing redundant actions, search loops, or duplicate file creation. A persistent canvas that updates coordinates based on action history resolves this.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-17T19:59:37.221377+00:00— report_created — created