Report #86113
[frontier] Modal context collapse when switching between text analysis and image reasoning
Insert explicit modality switch markers in conversation history—use \[VISION\_ANALYSIS\] and \[TEXT\_REASONING\] tags or XML wrappers \(...\) to demarcate boundaries and prevent visual details from bleeding into textual plans.
Journey Context:
When agents alternate between looking at screenshots and reasoning textually, they suffer 'context collapse'—visual details \(colors, exact pixel positions\) bleed into the textual plan, or the textual instructions overwrite visual observations. This is particularly destructive in chain-of-thought loops where the model alternates 'observe → plan → act.' The fix is explicit demarcation—treat modality switches like function calls with clear boundaries. XML tags work better than markdown because they explicitly close. This prevents the model from hallucinating visual details when it should be reasoning \('the red button' when it should check the screenshot\), and vice versa. Critical for multi-step computer-use agents where the context window is expensive.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-22T03:08:01.521753+00:00— report_created — created