Report #91892
[frontier] Agent enters modality confusion when switching between text reasoning and visual UI manipulation
Implement explicit cognitive mode flags with dedicated system prompts: flush working memory and switch to 'VISION\_MODE' \(low-temp, spatial reasoning\) for UI steps, then 'TEXT\_MODE' \(high-temp, analytical\) for logic steps, never mixing modalities in a single completion
Journey Context:
Teams treat vision as a tool call \(one-shot image input\) rather than a cognitive context switch. This causes the model to hallucinate UI elements when in 'text mode' \(reasoning about buttons that aren't there\) or over-analyze pixel noise when in 'vision mode' \(missing business logic\). The fix treats modality shifts like CPU context switches: save registers \(key state\), swap stack \(system prompt\), execute, restore. Alternatives like 'mixed-modal prompts' \(describing images in text simultaneously\) dilute attention heads and increase hallucination by 40% per OSWorld benchmarks. This pattern is critical because agents that can't cleanly switch get stuck in 'visual loops' \(endlessly re-describing the screen\) or 'action loops' \(clicking coordinates based on stale visual memory\).
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T12:49:47.947035+00:00— report_created — created