Report #43579
[frontier] Agents lose spatial context when switching from visual analysis to text generation and back causing ungrounded references
Maintain Visual Working Memory \(VWM\) as structured store of detected UI elements with coordinates and semantic tags referenced via IDs in text
Journey Context:
Standard RAG treats past screenshots as images; agents forget spatial relationships when they 'look away' to generate text. The VWM acts like human object file theory - maintaining continuity across saccades \(modality switches\). Store as JSON with element IDs, bounding boxes, attributes. When agent switches to text mode, it references 'element \#5' rather than 'the button on the left' which becomes ambiguous. Critical for multi-step tasks like 'click red button, then read text to its right' where agent must look away to read. Prevents coordinate drift across turns.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-19T03:37:13.075666+00:00— report_created — created