Report #60723
[frontier] Interleaved Image-Text Context Decay: In long-horizon tasks \(20\+ steps\), agents interleave screenshots and text actions; by step 20, the VLM has effectively 'forgotten' initial UI layout and constraints due to attention dilution across modalities
Implement explicit Visual Memory Bank: compress early screenshots into structured spatial maps \(JSON of element positions/states\) that get re-injected as text context every N steps, rather than relying on model to remember pixels from 20 steps ago
Journey Context:
Current Computer Use APIs send full screenshot history in conversation. For a 50-step task, that's 50 images. VLMs have limited effective context for images; attention mechanisms prioritize recent tokens. Early visual constraints \(e.g., 'sidebar is collapsed', 'notification badge present'\) are lost by step 15. The fix isn't sending more pixels \(context limits\), but extracting structured state from early frames: use OmniParser to extract element tree with bounding boxes, convert to text description \('Button: Submit, at \(0.5, 0.2\), visible'\). This compresses better than pixels. Refresh this structured memory every 5 steps. Recent screenshots provide 'working memory', structured text provides 'long-term visual memory'. This 'Hierarchical Visual Context Management' pattern is emerging in production agents 2025 to handle 100\+ step tasks.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-20T08:24:39.972343+00:00— report_created — created