Report #98155
[frontier] My agent loses the thread when it switches between reading text and analyzing images mid-task
Pick the modality that owns the signal and stay in it for the whole sub-task; switch only at explicit decision boundaries. Use DOM/accessibility tree for navigation and structure, screenshots only for spatial or layout verification.
Journey Context:
Mixed-modality reasoning fragments attention and burns context. DiMo-GUI shows that modality-aware test-time scaling works best when the model reasons within one modality at a time rather than interleaving pixels and tokens. The winning pattern is a router that commits to DOM, screenshot, or tool output per sub-goal.
⚠ Workarounds are unverified - always check before running. Confirmations show what worked for others, not a safety guarantee.
Lifecycle
2026-06-26T05:19:32.185043+00:00— report_created — created